Micro-architectural Analysis of OLAP: Limitations and Opportunities by Sirin, Utku & Ailamaki, Anastasia
UN
DE
R
SU
BM
IS
SI
ON
Micro-architectural Analysis of OLAP: Limitations and
Opportunities [Experiment and Analysis]
Utku Sirin
EPFL
utku.sirin@epfl.ch
Anastasia Ailamaki
EPFL
anastasia.ailamaki@epfl.ch
ABSTRACT
Understanding micro-architectural behavior is profound in
efficiently using hardware resources. Recent work has shown
that, despite being aggressively optimized for modern hard-
ware, in-memory online transaction processing (OLTP) sys-
tems severely underutilize their core micro-architecture re-
sources [25]. Online analytical processing (OLAP) work-
loads, on the other hand, exhibit a completely different com-
puting pattern. OLAP workloads are read-only, bandwidth-
intensive and include various data access patterns including
both sequential and random data accesses. In addition, with
the rise of column-stores, they run on high performance en-
gines that are tightly optimized for the efficient use of mod-
ern hardware. Hence, the micro-architectural behavior of
modern OLAP systems remains unclear.
This work presents the micro-architectural analysis of a
breadth of OLAP systems. We examine CPU cycles and
memory bandwidth utilization. The results show that, un-
like the traditional, commercial OLTP systems, traditional,
commercial OLAP systems do not suffer from instruction
cache misses. Nevertheless, they suffer from their large in-
struction footprint resulting in slow response times. High
performance OLAP engines execute tight instruction streams;
however, they spend 25 to 82% of the CPU cycles on stalls
regardless the workload being sequential- or random-access-
heavy. In addition, high performance OLAP engines under-
utilize the multi-core CPU or memory bandwidth resources
due to their disproportional compute and memory demands.
Hence, analytical processing engines should carefully assign
their compute and memory resources for efficient multi-core
micro-architectural utilization.
PVLDB Reference Format:
Ben Trovato, G. K. M. Tobin, Lars Thørva¨ld, Lawrence P. Leipuner,
Sean Fogarty, Charles Palmer, John Smith, Julius P. Kumquat,
and Ahmet Sacan. A Sample Proceedings of the VLDB Endow-
ment Paper in LaTeX Format. PVLDB, 12(xxx): xxxx-yyyy,
2019.
DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx
This work is licensed under the Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License. To view a copy
of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For
any use beyond those covered by this license, obtain permission by emailing
info@vldb.org. Copyright is held by the owner/author(s). Publication rights
licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 12, No. xxx
ISSN 2150-8097.
DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx
1. INTRODUCTION
Online analytical processing (OLAP) is an ever-growing,
multi-billion dollar industry. Many industrial and commu-
nity organizations rely on fast and efficient analytical pro-
cessing to extract valuable information from their data. Un-
derstanding micro-architectural behavior of OLAP systems,
on the other hand, is profound in providing high perfor-
mance. Micro-architectural behavior reveals the limitations
and opportunities in efficiently using modern hardware re-
sources, and hence allows delivering high performance. Re-
search has shown that OLAP systems can improve perfor-
mance orders of magnitude by more efficiently using the
modern hardware resources [20].
Micro-architectural behavior of online transaction process-
ing (OLTP) systems has been investigated extensively. Re-
cent work has shown that, despite being aggressively op-
timized for modern hardware, in-memory OLTP systems
severely underutilize their core micro-architecture resources
[25]. OLAP workloads, on the other hand, exhibit a com-
pletely different computing pattern. Unlike the update-
heavy OLTP workloads, OLAP workloads are read-only.
Therefore, they require neither a concurrency control and
logging mechanism nor a complex buffer pool for synchro-
nizing the modified pages with disk. Moreover, OLAP work-
loads are arithmetic-operation- and bandwidth-intensive. Th-
ey process large amounts of data with various data access
patterns including both sequential and random data ac-
cesses.
In addition, with the rise of column-stores [1, 8], a di-
verse set of OLAP execution models, e.g., vectorized [5] and
compiled execution [16], and system prototypes, e.g., Pro-
teus [15] and Typer and Tectorwise [17] have been proposed.
Most major database systems such as SQL Server, Oracle,
and DB2 now support a column-store extension [18, 19, 24].
Column-stores operate only on the columns that are nec-
essary for the query, and hence utilize memory bandwidth
more efficiently. Moreover, they process columns in tight,
hardware-friendly execution loops optimized for the efficient
use of the CPU cycles.
Following a read-only, arithmetic-operation- and bandwid-
th-intensive computing pattern, and running on systems us-
ing various execution models, micro-architectural behavior
of modern OLAP systems remains unclear. In this pa-
per, we examine the hardware behavior of a breadth of
OLAP systems from different categories of systems and ex-
ecution models. We profile a traditional, commercial row-
store, DBMS R, a column-store extension of a traditional,
commercial row-store, DBMS C, an open source, high per-
1
ar
X
iv
:1
90
8.
04
71
8v
1 
 [c
s.D
B]
  1
3 A
ug
 20
19
UN
DE
R
SU
BM
IS
SI
ON
formance OLAP engine following a compiled execution model,
Typer [17], and an open source, high performance OLAP
engine following a vectorized execution model, Tectorwise
[17]. We examine both CPU cycles and memory bandwidth
utilization. We examine how well each OLAP system uti-
lizes the hardware resources and what the limitations and
opportunities are. Our paper demonstrates the following:
• Unlike the traditional, commercial OLTP systems, the
traditional, commercial OLAP system and its column-
store extension do not suffer from instruction cache
misses. Nevertheless, they suffer from their large in-
struction footprints, which results in orders of magni-
tude lower performance than high performance OLAP
engines.
• High performance OLAP engines execute a tight in-
struction stream; however, they dramatically suffer
from high stall cycles ratio regardless the workload
being sequential- or random-data-access-heavy. Work-
loads with large sequential scans stress the memory
subsystem. Hence, the hardware prefetchers fall be-
hind, and 50 to 75% of the CPU cycles are spent on
stalls. Workloads with many random data accesses
suffer from long-latency data stalls, and hence spend
25 to 82% of the CPU cycles on stalls.
• High performance OLAP engines underutilize multi-
core CPU or memory bandwidth resources due to their
disproportional compute and memory demands. Work-
loads with large sequential scans quickly saturate the
multi-core memory bandwidth underutilizing multi-core
CPU resources. Workloads with many random ac-
cesses saturate the multi-core CPU resources before
saturating the multi-core memory resources, leaving
the multi-core memory bandwidth underutilized. Hence,
analytical processing engines should carefully assign
their compute and memory resources for efficient multi-
core micro-architectural utilization.
The rest of the paper is organized as follows. Section
2 presents the experimental setup and methodology. Sec-
tion 3, 4 and 5 present the projection, selection and join
micro-benchmark analyses. Section 6 presents the analy-
sis of TPC-H queries. Section 7, 8, 9 and 10 present the
analyses of predication, SIMD, hardware prefetchers, and
multi-core execution. Lastly, Section 11 and 12 present the
related work and conclusions.
2. SETUP &METHODOLOGY
This section presents our experimental setup and method-
ology.
Benchmarks: We use micro-benchmarks and a subset of
TPC-H queries [31]. We use projection, selection and join
micro-benchmarks as they constitute the basic SQL opera-
tors. All the systems use hash join algorithm when running
the join micro-benchmark. We also performed a group by
micro-benchmark, however, we observe that it behaves sim-
ilarly to the join at the micro-architectural level, and hence,
omitted the discussion on them.
All the micro-benchmarks use the TPC-H schema. The
projection micro-benchmark does a single SUM() over a set
of columns from the lineitem table. We vary the num-
ber of columns from one to four. We use l extendedprice,
Table 1: Broadwell server parameters.
Processor
Intel(R) Xeon(R) CPU
E5-2680 v4 (Broadwell)
#sockets 2
#cores per socket 14
Hyper-threading Off
Turbo-boost Off
Clock speed 2.40GHz
Per-core bandwidth
12GB/s (sequential)
7GB/s (random)
Per-socket bandwidth
66GB/s (sequential)
60GB/s (random)
L1I / L1D (per core)
32KB / 32KB
16-cycle miss latency
L2 (per core)
256KB
26-cycle miss latency
L3 (shared)
(inclusive) 35MB
160-cycle miss latency
Memory 256GB
l discount, l tax and l quantity columns. We add the pro-
jected columns inside the SUM(). We call the projection
micro-benchmark doing a SUM() over n columns as a pro-
jection query with degree of n.
The selection micro-benchmark extends the projection qu-
ery with degree of four with a WHERE clause of three pred-
icates over three columns of the lineitem table: l shipdate,
l commitdate and l receiptdate. It varies the selectivity of
each individual predicate from 10% to 50% and 90%. The
join micro-benchmark does a join over two tables followed
by a projection. The small-sized join micro-benchmark joins
supplier and nation tables over the nationkey attribute, and
does a SUM() over the addition of s acctbal and s suppkey.
The medium-sized join joins partsupplier and supplier ta-
bles over the supplierkey attribute, and does a SUM() over
the addition of ps availqty and ps supplycost. The large-
sized join joins lineitem and orders table over the orderkey
attribute, and does a SUM() over the addition of the four
columns that the projection query with degree of four uses.
We also profile four TPC-H queries: Q1, Q6, Q9 and Q18.
We chose these queries as each represents a particular cat-
egory. Q1 is a low-cardinality group by (4 groups), Q6 is a
highly selective filter, Q9 is a join-intensive query and Q18
is high-cardinality group by (1.5 million groups).
Hardware: We conduct our experiments on an Intel Broad-
well server. Table 1 presents the server parameters. As the
Broadwell micro-architecture does not support AVX-512 in-
structions, we conduct the SIMD experiments on a separate
Skylake server. The Skylake server has a similar execution
engine but a different memory hierarchy from the Broadwell
server. The Skylake server has a significantly larger L2 cache
(1 MB), a smaller non-inclusive L3 cache (16MB), a smaller
per-core (10 GB/s) and a larger per-socket (87 GB/s) se-
quential access bandwidth. It has a similar per-core and
per-socket random access bandwidth.
We use Intel’s Memory Latency Checker (MLC) [10] to
measure cache access latencies and maximum single- and
multi-core bandwidths.
OLAP systems: We examine a commercial row-store, DBMS
R, the column-store extension of the row-store, DBMS C,
an open-source OLAP engine implementing a compiled ex-
2
UN
DE
R
SU
BM
IS
SI
ON
ecution engine, Typer [17], and an open-source OLAP en-
gine implementing a vectorized execution engine, Tectorwise
[17]. We chose these systems as each represents a different
category of a system and execution model. As commercial
systems are closed-source we do a best-effort explanation of
their behavior.
OS & Compiler: We use Ubuntu 16.04.6 LTS and gcc
5.4.0 on the Broadwell server, and Ubuntu 18.04.2 LTS and
gcc 7.4.0 on the Skylake server.
VTune: We use Intel VTune 2018 on the Broadwell server,
and VTune 2019 on the Skylake server. We use VTune’s
built-in general-exploration (uarch-exploration on VTune 20-
19) analysis type for CPU cycles breakdown. VTune’s general-
exploration provide full CPU cycles breakdown [26, 32]. We
examine CPU cycles at two-levels. We firstly break down
the CPU cycles into Retiring and Stall cycles. Retiring cy-
cles represent the percentage of the useful cycles spent on
retiring instructions. Stall cycles represent the percentage
of the wasted cycles spent on stalls. Secondly, we zoom
into the Stall cycles breakdown composed of five compo-
nents: (i) Branch mispredictions, (ii) Icache, (iii) Decoding,
(iv) Dcache and (v) Execution. Branch mispredictions rep-
resent stalls due to the branch mispredictions. Icache rep-
resents Icache misses stalls. Decoding represents the stalls
due to the inefficiencies in the instruction decoding micro-
architecture. Dcache represents the stalls due to memory
hierarchy. Lastly, Execution represents the stalls due to the
saturation of the core execution resources.
We use VTune’s built-in memory-access analysis type to
measure the used memory bandwidth. As we numa-localize
our experiments on a single socket, we report average band-
width per-socket values.
Measurements: For every experiment, we firstly populate
the database. We use one minute of warmup period followed
by three minutes of VTune profiling period. We disable
Hyper-threading and Turbo-boost as they jeopardize VTune
counter values [11].
We numa-localize every experiment by using Linux’s nu-
mactl command. Except the multi-core execution experi-
ments presented in Section 10, we run all the experiments
on a single core over a TPC-H database with scaling factor
of five, i.e., a database of 5GB. We chose scaling factor of
five as it is large enough to run out-of-cache experiments.
We run the multi-core experiments on a single socket, over
a TPC-H database with scaling factor of 70, i.e., database
of 70GB.
Hardware prefetchers experiments in Section 9 are done
by modifying the relevant model-specific register (msr) of
the processor [9].
3. PROJECTION
This section presents the micro-architectural analysis of
the projection micro-benchmark. Our goal is to observe
how the micro-architectural behavior changes as the projec-
tivity increases. Figure 1 shows the CPU cycles breakdown
for DBMS R and C. The figure shows that while DBMS R
spends about half of the CPU cycles for Retiring, DBMS C
spends almost 90% of the CPU cycles for Retiring.
Figure 2 shows the stall cycles breakdown. As the figure
shows, DBMS R spends the majority of the stall cycles on
Dcache and Execution stalls. This shows that, unlike the
traditional, commercial OLTP systems spending the major-
ity of the CPU cycles to Icache stalls, traditional, commer-
1	
Projectivity 
0%	
20%	
40%	
60%	
80%	
100%	
p1	 p2	 p3	 p4	 p1	 p2	 p3	 p4	
DBMS	R	 DBMS	C	
CP
U
	c
yc
le
s	(
%
)	
Projectivity	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
p1	 p2	 p3	 p4	 p1	 p2	 p3	 p4	
DBMS	R	 DBMS	C	
St
al
l	c
yc
le
s	(
%
)	
Projectivity	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
0%	
20%	
40%	
60%	
80%	
100%	
p1	 p2	 p3	 p4	 p1	 p2	 p3	 p4	
Typer	 Tectorwise	
CP
U
	c
yc
le
s	(
%
)	
Projectivity	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
p1	 p2	 p3	 p4	 p1	 p2	 p3	 p4	
Typer	 Tectorwise	
St
al
l	c
yc
le
s	(
%
)	
Projectivity	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
Figure 1: CPU cycles breakdown for projection as
projectivity increases for DBMS R and DBMS C.
1	
Projectivity 
0%	
20%	
40%	
60%	
80%	
100%	
p1	 p2	 p3	 p4	 p1	 p2	 p3	 p4	
DBMS	R	 DBMS	C	
CP
U
	c
yc
le
s	(
%
)	
Projectivity	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
p1	 p2	 p3	 p4	 p1	 p2	 p3	 p4	
DBMS	R	 DBMS	C	
St
al
l	c
yc
le
s	(
%
)	
Projectivity	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
0%	
20%	
40%	
60%	
80%	
100%	
p1	 p2	 p3	 p4	 p1	 p2	 p3	 p4	
Typer	 Tectorwise	
CP
U
	c
yc
le
s	(
%
)	
Projectivity	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
p1	 p2	 p3	 p4	 p1	 p2	 p3	 p4	
Typer	 Tectorwise	
St
al
l	c
yc
le
s	(
%
)	
Projectivity	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
Figure 2: Stall cycles breakdown for projection as
projectivity increases for DBMS R.
cial OLAP systems do not suffer significantly from Icache
stalls. On the other hand, DBMS C spends the major-
ity of the stall cycles on Branch mispredictions and Icache
stalls. However, its overall stall cycles ratio is less than
10%. Hence, none of the systems suffer significantly from
the Icache stalls.
Figure 3 shows the CPU cycles breakdown for Typer and
Tectorwise. As can be seen, both Typer and Tectorwise
spend the majority of CPU cycles on stalls. While Typer’s
stall cycles ratio increases from 60% to 75% as projectivity
increases, Tectorwise’s stall cycles ratio remains 60% as the
projectivity increases.
Figure 4 shows the stall cycles breakdown for Typer and
Tectorwise. We observe that the stall cycles of Typer are
mainly due to Dcache stalls. Moreover, as the projectiv-
ity increases, Dcache stall cycles ratio also increases. On
the other hand, the stall cycles breakdown of Tectorwise re-
mains the same as the projectivity increases: Dcache and
Execution stalls contribute equally to the stall cycles.
The reason for the increasing Dcache stalls of Typer is
the saturation of the single-core sequential memory band-
width, which results in a super-linear increase in the Dcache
stalls. The reason for the stable stall cycles breakdown of
Tectorwise is the vectorized execution model. Vectorized ex-
ecution model adds two vectors and materializes the result
in another intermediate vector, which will later be used to
3
UN
DE
R
SU
BM
IS
SI
ON
1	
Projectivity 
0%	
20%	
40%	
60%	
80%	
100%	
p1	 p2	 p3	 p4	 p1	 p2	 p3	 p4	
DBMS	R	 DBMS	C	
CP
U
	c
yc
le
s	(
%
)	
Projectivity	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
p1	 p2	 p3	 p4	 p1	 p2	 p3	 p4	
DBMS	R	 DBMS	C	
St
al
l	c
yc
le
s	(
%
)	
Projectivity	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
0%	
20%	
40%	
60%	
80%	
100%	
p1	 p2	 p3	 p4	 p1	 p2	 p3	 p4	
Typer	 Tectorwise	
CP
U
	c
yc
le
s	(
%
)	
Projectivity	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
p1	 p2	 p3	 p4	 p1	 p2	 p3	 p4	
Typer	 Tectorwise	
St
al
l	c
yc
le
s	(
%
)	
Projectivity	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
Figure 3: CPU cycles breakdown for projection as
projectivity increases for Typer and Tectorwise.
1	
Projectivity 
0%	
20%	
40%	
60%	
80%	
100%	
p1	 p2	 p3	 p4	 p1	 p2	 p3	 p4	
DBMS	R	 DBMS	C	
CP
U
	c
yc
le
s	(
%
)	
Projectivity	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
p1	 p2	 p3	 p4	 p1	 p2	 p3	 p4	
DBMS	R	 DBMS	C	
St
al
l	c
yc
le
s	(
%
)	
Projectivity	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
0%	
20%	
40%	
60%	
80%	
100%	
p1	 p2	 p3	 p4	 p1	 p2	 p3	 p4	
Typer	 Tectorwise	
CP
U
	c
yc
le
s	(
%
)	
Projectivity	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
p1	 p2	 p3	 p4	 p1	 p2	 p3	 p4	
Typer	 Tectorwise	
St
al
l	c
yc
le
s	(
%
)	
Projectivity	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
Figure 4: Stall cycles breakdown for projection as
projectivity increases for Typer and Tectorwise.
add the other columns in the aggregation. Hence, at the
degree of two and onwards, the processor is subject to the
same computing pattern: two vectors being added and the
result being written to a third vector. As a result, the stall
cycles breakdown remains stable despite the increasing pro-
jectivity.
Figure 5 presents the memory bandwidth utilization of
Typer and Tectorwise. As the figure shows, Typer almost
saturates the single-core memory bandwidth at the projec-
tivity of degree two and onwards due to the high pressure on
the memory subsystem. Tectorwise, on the other hand, has
low memory bandwidth utilization due to materialization
overheads. The materialization overheads cut the memory
pressure of Tectorwise.
The projection query is a simple sequential sum over a set
of columns with a highly predictable data access pattern. As
a result, hardware prefetchers are highly likely to predict the
future data blocks. Despite that, Dcache stalls constitute a
large portion of the execution time even when the mem-
ory bandwidth is not saturated. This shows that today’s
hardware prefetchers are not fast enough for sequential-scan-
heavy analytical workloads. We examine hardware prefetch-
ers behavior in more detail in Section 9.
In addition, today’s power-hungry server processors pro-
vide wide execution engines. Our Broadwell server has eight
execution ports, four of them including an ALU unit [12].
Projectivity - Bandwidth 
0	
2	
4	
6	
8	
10	
12	
p1	 p2	 p3	 p4	 p1	 p2	 p3	 p4	
Typer	 Tectorwise	
Ba
nd
w
id
th
	(G
B/
s)
	
Projectivity	
MAX	
Figure 5: Single-core sequential access bandwidth
utilization for Typer and Tectorwise when running
projection as projectivity increases.
Projectivity – exec. time 
0	
5000	
10000	
1500 	
20000	
DBMS	R	 DBMS	C	 Typer	 Tectorwise	
Re
sp
on
se
	ti
m
e	
(m
s)
	
Stall	 Retiring	
0	
50	
100	
150	
200	
250	
DBMS	R	 DBMS	C	 Typer	 Tectorwise	
N
or
m
al
ize
d	
re
sp
on
se
	ti
m
e	
Stall	
Retiring	
Figure 6: Normalized (with respect to Typer) re-
sponse time breakdown for the projection query
with degree of 4 across DBMSs.
Despite that, Typer at the projectivity of degree 1, and Tec-
torwise at all the projectivities significantly suffer from Ex-
ecution stalls. This shows that, despite being wide, todays
processors fall short on providing enough execution units for
the arithmetic-heavy analytical workloads.
Figure 6 presents the normalized response time breakdown
for the four systems when running the projection micro-
benchmark with degree of four. As the figure shows, DBMS
R is two orders of magnitude slower than Typer and Tector-
wise. While DBMS C’s optimizations make it an order of
magnitude faster than DBMS R, it is nevertheless an order
of magnitude slower than Typer and Tectorwise. Moreover,
both DBMS R and C have high Retiring cycles ratio. As the
number of Retiring cycles is correlated to the number of re-
tired instructions, high Retiring cycles show that DBMS R
and C severely suffer from their large instruction footprints.
Note that, despite that Typer and Tectorwise are open-
source OLAP prototype engines, their performance is rep-
resentative for fully implemented high-performant systems
such as HyPer and VectorWise. Kersten et al. [17] have
compared the performance of Typer and Tectorwise with
real-life systems HyPer and VectorWise, and have shown
that Typer and Tectorwise are only slightly faster than Hy-
Per and VectorWise.
4
UN
DE
R
SU
BM
IS
SI
ON
Selectivity 
0%	
20%	
40%	
60%	
80%	
100%	
10%	 50%	 90%	 10%	 50%	 90%	
DBMS	R	 DBMS	C	
CP
U
	c
yc
le
s	(
%
)	
Selectivity	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
10%	 50%	 90%	 10%	 50%	 90%	
Typer	 Tectorwise	
St
al
l	c
yc
le
s	(
%
)	
Selectivity	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
0%	
20%	
40%	
60%	
80%	
100%	
10%	 50%	 90%	 10%	 50%	 90%	
Typer	 Tectorwise	
CP
U
	c
yc
le
s	(
%
)	
Selectivity	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
10%	 50%	 90%	 10%	 50%	 90%	
DBMS	R	 DBMS	C	
St
al
l	c
yc
le
s	(
%
)	
Selectivity	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
Figure 7: CPU cycles breakdown for selection as
selectivity increases for DBMS R and DBMS C.
Selectivity 
0%	
20%	
40%	
60%	
80%	
100%	
10%	 50%	 90%	 10%	 50%	 90%	
DBMS	R	 DBMS	C	
CP
U
	c
yc
le
s	(
%
)	
Selectivity	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
10%	 50%	 90%	 10%	 50%	 90%	
Typer	 Tectorwise	
St
al
l	c
yc
le
s	(
%
)	
Selectivity	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
0%	
20%	
40%	
60%	
80%	
100%	
10%	 50%	 90%	 10%	 50%	 90%	
Typer	 Tectorwise	
CP
U
	c
yc
le
s	(
%
)	
Selectivity	
Stall	 Retiring	
0%	
20%	
40%	
60 	
80 	
100%	
10%	 50%	 90%	 10%	 50%	 90%	
DBMS	R	 DBMS	C	
St
al
l	c
yc
le
s	(
%
)	
Selectivity	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
Figure 8: Stall cycles breakdown for selection as se-
lectivity increases for DBMS R and DBMS C.
4. SELECTION
Having examined the projection, we now move to examin-
ing the selection micro-benchmark. Our goal is to examine
how influential the branch mispredictions stalls are on the
micro-architectural behavior. Figure 7 shows the CPU cy-
cles breakdown for DBMS R and C. We observe that the Re-
tiring cycles ratio increases as the selectivity increases both
for DBMS R and C. It is because the higher the amount of
computation is (due to the higher selectivity), the more the
instruction overheads consume the CPU cycles.
Figure 8 shows the stall cycles breakdown. We observe
that DBMS R does not suffer from Icache stalls signifi-
cantly. Moreover, DBMS C suffers from the Decoding stalls
only at the higher selectivities, where the overall stall cy-
cles ratio is low. Hence, none of the commercial systems
majorly suffer from the instruction-related stalls. Neverthe-
less, DBMS R and C are 1.6x to 40x slower than the high
performance OLAP engines proving their large instruction
footprints (graph not shown).
Figure 9 shows the CPU cycles breakdown for Typer and
Tectorwise. Unlike DBMS R and C, Typer and Tectorwise
have the highest stall cycles ratio at 50%. Figure 10 shows
the stall cycles breakdown. The figure shows that branch
mispredictions dominate the stall cycles. In addition, the
highest branch misprediction stalls are at the 50% selectiv-
ity. The reason is that the prediction task is the hardest
Selectivity 
0%	
20%	
40%	
60%	
80%	
100%	
10%	 50%	 90%	 10%	 50%	 90%	
DBMS	R	 DBMS	C	
CP
U
	c
yc
le
s	(
%
)	
Selectivity	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
10%	 50%	 90%	 10%	 50%	 90%	
Typer	 Tectorwise	
St
al
l	c
yc
le
s	(
%
)	
Selectivity	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
0%	
20%	
40%	
60%	
80%	
100%	
10%	 50%	 90%	 10%	 50%	 90%	
Typer	 Tectorwise	
CP
U
	c
yc
le
s	(
%
)	
Selectivity	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
10%	 50%	 90%	 10%	 50%	 90%	
DBMS	R	 DBMS	C	
St
al
l	c
yc
le
s	(
%
)	
Selectivity	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
Figure 9: CPU cycles breakdown for selection as
selectivity increases for Typer and Tectorwise.
Selectivity 
0%	
20%	
40%	
60%	
80%	
100%	
10%	 50%	 90%	 10%	 50%	 90%	
DBMS	R	 DBMS	C	
CP
U
	c
yc
le
s	(
%
)	
Selectivity	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
10%	 50%	 90%	 10%	 50%	 90%	
Typer	 Tectorwise	
St
al
l	c
yc
le
s	(
%
)	
Selectivity	
Execution	 Dcache	 Dec ding	 Icache	 Branch	misp.	
0%	
20%	
40%	
60%	
80%	
100%	
10%	 50%	 90%	 10%	 50%	 90%	
Typer	 Tectorwise	
CP
U
	c
yc
le
s	(
%
)	
Selectivity	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
10%	 50%	 90%	 10%	 50%	 90%	
DBMS	R	 DBMS	C	
St
al
l	c
yc
le
s	(
%
)	
Selectivity	
Execution	 Dcache	 Dec ding	 Icache	 Branch	misp.	
Figure 10: Stall cycles breakdown for selection as
selectivity increases for Typer and Tectorwise.
at the 50% selectivity, corroborating the existing work by
Sompolski et al. [27].
Typer suffers less from the branch mispredictions at the
10% selectivity compared to Tectorwise. This is because
Tectorwise is a vectorized engine, and thus individually eval-
uates every predicate. As a result, the branch predictor al-
ways faces the 10% individual selectivity of each predicate.
Typer, however, is a compiled engine, and thus evaluates all
the predicates at once. Hence, the branch predictor faces a
much lower overall selectivity (10% × 10% × 10% = 0.01%),
making the prediction task easier for the branch predictor.
We also examine the memory bandwidth utilization for
the selection micro-benchmark (graph not shown). We ob-
serve that memory bandwidth utilization is well below the
maximum bandwidth both for Typer and Tectorwise. Typer
uses, 3, 5 and 5 GB/s, whereas Tectorwise uses 2.5, 3 and
3 GB/s of memory bandwidth for selectivities of 10%, 50%
and 90%, respectively. This is because of the branch mispre-
diction stalls preventing the core creating enough memory
traffic to saturate bandwidth.
Overall, the experiments in this section show that branch
mispredictions constitute a significant performance bottle-
neck for high performance OLAP systems. We examine
predication in Section 7 to see how micro-architectural be-
havior changes when branch mispredictions are eliminated.
5
UN
DE
R
SU
BM
IS
SI
ON
Join 
0%	
20%	
40%	
60%	
80%	
100%	
Sm.	 Md.	 Lr.	 Sm.	 Md.	 Lr.	
DBMS	R	 DBMS	C	
CP
U
	c
yc
le
s	(
%
)	
Join	size	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
Sm.	 Md.	 Lr.	 Sm.	 Md.	 Lr.	
DBMS	R	 DBMS	C	
St
al
l	c
yc
le
s	(
%
)	
Join	size	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
0%	
20%	
40%	
60%	
80%	
100%	
Sm.	 Md.	 Lr.	 Sm.	 Md.	 Lr.	
Typer	 Tectorwise	
CP
U
	c
yc
le
s	(
%
)	
Join	size	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
Sm.	 Md.	 Lr.	 Sm.	 Md.	 Lr.	
Typer	 Tectorwise	
St
al
l	c
yc
le
s	(
%
)	
Join	size	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
Figure 11: CPU cycles breakdown for join for
DBMS R and C. Sm., Md. and Lr. stand for Small,
Medium and Large, respectively.
Join 
0%	
20%	
40%	
60%	
80%	
100%	
Sm.	 Md.	 Lr.	 Sm.	 Md.	 Lr.	
DBMS	R	 DBMS	C	
CP
U
	c
yc
le
s	(
%
)	
Join	size	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
Sm.	 Md.	 Lr.	 Sm.	 Md.	 Lr.	
DBMS	R	 DBMS	C	
St
al
l	c
yc
le
s	(
%
)	
Join	size	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
0%	
20%	
40%	
60%	
80%	
100%	
Sm.	 Md.	 Lr.	 Sm.	 Md.	 Lr.	
Typer	 Tectorwise	
CP
U
	c
yc
le
s	(
%
)	
Join	size	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
Sm.	 Md.	 Lr.	 Sm.	 Md.	 Lr.	
Typer	 Tectorwise	
St
al
l	c
yc
le
s	(
%
)	
Join	size	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
Figure 12: CPU cycles breakdown for join for Typer
and Tectorwise. Sm., Md. and Lr. stand for Small,
Medium and Large, respectively
5. JOIN
In this section, we examine the join micro-benchmark. We
force all the systems to use hash join algorithm. Unlike the
selection and projection micro-benchmarks with a sequential
data access pattern, hash join includes many random data
accesses. Our goal is to understand the effect of random
data accesses in the overall micro-architectural behavior.
Figure 11 shows the CPU cycles breakdown for DBMS R
and C. As can be seen, DBMS R and C spend 52 to 72%
of the CPU cycles for retiring instructions for all the join
sizes. Observing a similar CPU cycles breakdown despite
the orders of magnitude increase in the joined table sizes
indicates the large instruction footprint of DBMS R and C
that overshadows the micro-architectural behavior.
Figure 12 and 13 show the CPU and stall cycles break-
down for Typer and Tectorwise. The CPU cycles breakdown
shows that the stall cycles ratio increases as the join size in-
creases. The Retiring cycles ratio can get as low as 18% for
the large-sized join operation. The stall cycles breakdown
shows that Dcache stalls become more and more dominant
as the join size increases. As expected, the larger the joined
table sizes are, the larger the number of random data ac-
cesses are, which results in an increased stall cycles ratio.
On the other hand, for the small- and medium-sized joins,
Join 
0%	
20%	
40%	
60%	
80%	
100%	
Sm.	 Md.	 Lr.	 Sm.	 Md.	 Lr.	
DBMS	R	 DBMS	C	
CP
U
	c
yc
le
s	(
%
)	
Joins	size	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
Sm.	 Md.	 Lr.	 Sm.	 Md.	 Lr.	
DBMS	R	 DBMS	C	
St
al
l	c
yc
le
s	(
%
)	
Join	size	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
0%	
20%	
40%	
60%	
80%	
100%	
Sm.	 Md.	 Lr.	 Sm.	 Md.	 Lr.	
Typer	 Tectorwise	
CP
U
	c
yc
le
s	(
%
)	
Joins	size	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
Sm.	 Md.	 Lr.	 Sm.	 Md.	 Lr.	
Typer	 Tectorwise	
St
al
l	c
yc
le
s	(
%
)	
Join	size
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
Figure 13: Stall cycles breakdown for join for Typer
and Tectorwise. Sm., Md. and Lr. stand for Small,
Medium and Large, respectively
Join – exec time & b/w 
0	
2000	
4000	
6000	
8000	
10000	
R	 C	 Ty.	 Tw.	
Re
sp
on
se
	ti
m
e	
(m
s)
	 Stall	
Retiring	
0	
1	
2	
3	
4	
5	
6	
7	
Ty.	 Tw.	
Ba
nd
w
id
th
	(G
B/
s)
	
MAX	
0	
1	
2	
3	
4	
5	
6	
7	
R	 C	 Ty.	 Tw.	
N
or
m
al
ize
d	
re
sp
on
se
	ti
m
e	 Stall	
Retiring	
Figure 14: Left: Single-core random access band-
width utilization when running the large-sized join.
Right: Normalized (with respect to Typer) response
time breakdown for the large-sized join. R, C, Ty.
and Tw. stand for DBMS R, DBMS C, Typer and
Tectorwise.
the Execution stalls constitute a significant portion of the
stall cycles highlighting costly hash computations.
Figure 14 (left) shows the memory bandwidth utilization
when running the large-sized join micro-benchmark. As the
figure shows, the bandwidth utilizations are well-below the
maximum random access bandwidth. This shows that Typer
and Tectorwise do not create enough memory traffic to use
the available single-core random access bandwidth. This
finding corroborates with the existing work on using corou-
tines to improve hash join performance [13, 21]. Coroutines
allow overlapping long-latency memory stalls with compu-
tation enabling a more efficient utilization of the memory
bandwidth. Psaropoulos et al. [22, 23] has shown that
memory bandwidth starts getting saturated with 28 cores
(per-socket) and upwards.
Figure 14 (right) presents the normalized response time
breakdowns. As the figure shows, DBMS R and C are 4.5x
and 6.3x slower than Typer and Tectorwise by spending or-
ders of magnitude more time on retiring instructions. Hence,
DBMS R and C suffers from orders of magnitude larger in-
struction footprints compared to Typer and Tectorwise.
6
UN
DE
R
SU
BM
IS
SI
ON
6. TPC-H
Up to now, we have examined simple micro-benchmarks.
In this section, we analyze four TPC-H queries: Q1, Q6
and Q9, Q18, each of which represents a particular class
of queres: (i) Q1 is a low-cardinality group by (4 groups),
(ii) Q6 is a highly selective filter, (iii) Q9 is a join-intensive
query and (iv) Q18 is a high-cardinality group by (1.5 mil-
lion groups). We, once again, observed orders of magnitude
difference in the response times of the commercial and high
performance systems. Hence, we omit the discussion on the
commercial systems, and focus on the two high performance
systems we profile.
Figure 15 shows the CPU cycles breakdown. As can be
seen, both Typer and Tectorwise have the highest Retiring
cycles ratio when running Q1. Whereas Typer has the lowest
Retiring cycles ratio when running Q9, and Tectorwise has
the lowest Retiring cycles ratio when running Q6.
Figure 16 shows the stall cycles breakdown. The figure
shows that the stall cycles breakdowns are dominated by
the Execution stalls when running Q1. This is similar to
the small-sized hash join micro-benchmark. Q1 has a small
hash table keeping the small number of group aggregations.
Hence, its main working set fits into the cache, and Dcache
stalls are mostly eliminated. This time, however, the Exe-
cution stalls surface up causing the processor to stall ∼40%.
Figure 16 shows that Q6 is dominated by Dcache stalls
for Typer, but Branch mispredictions for Tectorwise. The
overall selectivity of Q6 is ∼2%. However, it includes five
individual predicates with varying selectivities. As being a
compiled engine, Typer only experiences the 2% overall se-
lectivity, whereas Tectorwise, as being a vectorized engine,
evaluates every predicate individually. Hence, the processor
experiences the selectivities of the individual predicates that
vary from 16% to 48%, causing much more branch mispre-
dictions.
Figure 16 shows that Dcache stalls are the dominant factor
in the overall stall cycles both for Typer and Tectorwise
when running Q9. This is expected as Q9 is join-intensive,
and hence makes many random data accesses. However,
unlike the simple join micro-benchmark, Q9 also incurs a
significant number of Branch misprediction stalls. This is
because Q9’s successive hash joins require successive hash
table probings whose outcome is less and less predictable
as the processing moves upwards in the query plan. As a
result, the branch predictor fails significant number of times
resulting in as high Branch misprediction stalls as ∼38%.
Figure 16 shows that the stall cycles breakdown of Q18 is
similar to that of Q9, with less Dcache stalls. This is because
Q18 is a high-cardinality group by (1.5 million groups),
which requires successive hashing of multiple attributes. As
the number of groups is large, the hash table keeping the
group aggregations does not fit into the cache. Hence, the
group aggregation updates result in Dcache stalls. Q18 also
significantly suffers from Branch mispredictions and Exe-
cution stalls. The Branch mispredictions stalls are due to
the frequent hash collusions during the aggregation updates.
The Execution stalls are due to the successive hash compu-
tations for grouping the group by attributes.
Hash tables built for group by operator cause more hash
collusions than the hash tables built for hash join. The
reason is that different groups that share a common group
attribute have a higher chance of colluding than the evenly
distributed primary/foreign keys. Our analysis of a group
tpch 
0%	
20%	
40%	
60%	
80%	
100%	
Q1	 Q6	 Q9	 Q18	 Q1	 Q6	 Q9	 Q18	
DBMS	R	 DBMS	C	
CP
U
	c
yc
le
s	(
%
)	
Query	Id	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
Q1	 Q6	 Q9	 Q18	 Q1	 Q6	 Q9	 Q18	
DBMS	R	 DBMS	C	
St
al
l	c
yc
le
s	(
%
)	
Query	Id	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
0%	
20%	
40%	
60%	
80%	
100%	
Q1	 Q6	 Q9	 Q18	 Q1	 Q6	 Q9	 Q18	
Typer	 Tectorwise	
CP
U
	c
yc
le
s	(
%
)	
Query	Id	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
Q1	 Q6	 Q9	 Q18	 Q1	 Q6	 Q9	 Q18	
Typer	 Tectorwise	
St
al
l	c
yc
le
s	(
%
)	
Query	Id	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
Figure 15: CPU cycles breakdown for TPC-H
queries for Typer and Tectorwise.
tpch 
0%	
20%	
40%	
60%	
80%	
100%	
Q1	 Q6	 Q9	 Q18	 Q1	 Q6	 Q9	 Q18	
DBMS	R	 DBMS	C	
CP
U
	c
yc
le
s	(
%
)	
Query	Id	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
Q1	 Q6	 Q9	 Q18	 Q1	 Q6	 Q9	 Q18	
DBMS	R	 DBMS	C	
St
al
l	c
yc
le
s	(
%
)	
Query	Id	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
0%	
20%	
40%	
60%	
80%	
100%	
Q1	 Q6	 Q9	 Q18	 Q1	 Q6	 Q9	 Q18	
Typer	 Tectorwise	
CP
U
	c
yc
le
s	(
%
)	
Que y	Id	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
Q1	 Q6	 Q9	 Q18	 Q1	 Q6	 Q9	 Q18	
Typer	 Tectorwise	
St
al
l	c
yc
le
s	(
%
)	
Query	Id	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
Figure 16: Stall cycles breakdown for TPC-H
queries for Typer and Tectorwise.
by and hash join micro-benchmark showed that the lengths
of the hash table chains vary from 0 to 7 with an average
chain length of 0.23 and standard deviation of 0.5 for the
group by, whereas vary from 0 to 1 with an average chain
length of 0.44 and standard deviation of 0.49 for the hash
join micro-benchmark. Hence, the hash table built for the
group by micro-benchmark is much more irregular than that
of the hash join, causing more hash collusions.
We also examined the memory bandwidth utilization for
TPC-H queries (graph not shown). We observed that the
used bandwidth is always less than 1 GB/s for all the queries
and for both Typer and Tectorwise, except Typer when run-
ning Q6. Q6 running on Typer has 4.7 GB/s bandwidth uti-
lization. The reason is that Q6 is a low-selectivity selection
query making sequential scans to evaluate the predicates.
All the other queries suffer from costly hash computations
preventing the system creating high memory pressure.
Overall, the experiments in this section show that the ma-
jor micro-architectural bottlenecks that the micro-benchma-
rks identified generalizes to complex TPC-H queries. Hence,
we can evaluate micro-architectural behavior of a given query
by examining its individual operators, and based on that, we
can identify the limitations and opportunities in using the
micro-architectural resources more efficiently.
7
UN
DE
R
SU
BM
IS
SI
ON
predication 
0	
50	
100	
150	
200	
250	
Br.	 Br.-free	 Br.	 Br.-free	 Br.	 Br.-free	
10%	 50%	 90%	
Re
sp
on
se
	ti
m
e	
(m
s)
	
Selectivity	
Stall	 Retiring	
0	
40	
80	
120	
160	
Br.	 Br.-free	 Br.	 Br.-free	 Br.	 Br.-free	
10%	 50%	 90%	
St
al
l	t
im
e	
(m
s)
	
Selectivity	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
0	
50	
100	
150	
200	
250	
300	
Br.	 Br.-free	 Br.	 Br.-free	 Br.	 Br.-free	
10%	 50%	 90%	
Re
sp
on
se
	ti
m
e	
(m
s)
	
Selectivity	
Stall	 Retiring	
0	
50	
100	
150	
200	
Br.	 Br.-free	 Br.	 Br.-free	 Br.	 Br.-free	
10%	 50%	 90%	
St
al
l	t
im
e	
(m
s)
	
Selectivity	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
Figure 17: Response time breakdown for Typer
when running branched and branch-free selection
query.
predication 
0	
50	
100	
150	
200	
250	
Br.	 Br.-free	 Br.	 Br.-free	 Br.	 Br.-free	
10%	 50%	 90%	
Re
sp
on
se
	ti
m
e	
(m
s)
	
Selectivity	
Stall	 Retiring	
0	
40	
80	
120	
160	
Br.	 Br.-free	 Br.	 Br.-free	 Br.	 Br.-free	
10%	 50%	 90%	
St
al
l	t
im
e	
(m
s)
	
Selectivity	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
0	
50	
100	
150	
200	
250	
300	
Br.	 Br.-free	 Br.	 Br.-free	 Br.	 Br.-free	
10%	 50%	 90%	
Re
sp
on
se
	ti
m
e	
(m
s)
	
Selectivity	
Stall	 Retiring	
0	
50	
100	
150	
200	
Br.	 Br.-free	 Br.	 Br.-free	 Br.	 Br.-free	
10%	 50%	 90%	
St
al
l	t
im
e	
(m
s)
	
Selectivity	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
Figure 18: Stall time breakdown for Typer when
running branched and branch-free selection query.
7. PREDICATION
In this section, we examine the predication optimization.
Predication is used to eliminate branches. Its idea is to con-
vert control dependencies to data dependencies by comput-
ing the predicate as an arithmetic expression, and using it
to increment the index/aggregation. The trade-off is doing
more computation but avoid branches. Our goal is to exam-
ine how predication changes the micro-architectural behav-
ior. We examine predication for Typ r and Tectorwis .
Figure 17 shows the response time breakdown for Typer.
While Br. on the x-axis refers to the branched version, Br.-
free refers to the predicated, branch-free version. The figure
shows that predication increases the response time for 10%
selectivity, whereas decreases the response time for 50% and
90% selectivities. Figure 18 shows the stall time breakdown
for Typer. The figure shows that predication successfully
eliminates the branch mispredictions, and makes the selec-
tion query Dcache- and Execution-stalls-bound similar to
the projection query.
Figure 19 and 20 shows the response and stall time break-
down for Tectorwise. As the figures show predication always
decreases the response time for all the selectivities, and elim-
inates the branch misprediction stalls. In addition, predica-
tion makes the selection query Dcache- and Execution-stalls-
bound similar to the projection query.
predication 
0	
50	
100	
150	
200	
250	
Br.	 Br.-free	 Br.	 Br.-free	 Br.	 Br.-free	
10%	 50%	 90%	
Re
sp
on
se
	ti
m
e	
(m
s)
	
Selectivity	
Stall	 Retiring	
0	
40	
80	
120	
160	
Br.	 Br.-free	 Br.	 Br.-free	 Br.	 Br.-free	
10%	 50%	 90%	
St
al
l	t
im
e	
(m
s)
	
Selectivity	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
0	
50	
100	
150	
200	
250	
300	
Br.	 Br.-free	 B .	 Br.-free	 B .	 Br.-free	
10%	 50%	 90%	
Re
sp
on
se
	ti
m
e	
(m
s)
	
Selectivity	
Stall	 Retiring	
0	
50	
100	
150	
200	
Br.	 Br.-free	 Br.	 Br.-free	 Br.	 Br.-free	
10%	 50%	 90%	
St
al
l	t
im
e	
(m
s)
	
Selectivity	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
Figure 19: Response time breakdown for Tector-
wise when running branched and branch-free selec-
tion query.
predication 
0	
50	
100	
150	
200	
250	
Br.	 Br.-free	 Br.	 Br.-free	 Br.	 Br.-free	
10%	 50%	 90%	
Re
sp
on
se
	ti
m
e	
(m
s)
	
Selectivity	
Stall	 Retiring	
0	
40	
80	
120	
160	
Br.	 Br.-free	 Br.	 Br.-free	 Br.	 Br.-free	
10%	 50%	 90%	
St
al
l	t
im
e	
(m
s)
	
Selectivity	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
0	
50	
100	
150	
200	
250	
300	
Br.	 Br.-free	 Br.	 Br.-free	 Br.	 Br.-free	
10%	 50%	 90%	
Re
sp
on
se
	ti
m
e	
(m
s)
	
Selectivity	
Stall	 Retiring	
0	
50	
100	
150	
200	
Br.	 Br.-free	 Br.	 Br.-free	 Br.	 Br.-free	
10%	 50%	 90%	
St
al
l	t
im
e	
(m
s)
	
Selectivity	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
Figure 20: Stall time breakdown for Tectorwise
when running branched and branch-free selection
query.
The reason for predication to be more successful on Tec-
torwise than Typer is that Tectorwise relies on intermediate
selection vectors to implement the selection operator. As
a result, the only additional computation that predication
brings is to compute more during the computation of the
selection vectors, whereas the bulk of the projection com-
putation remains the same. Whereas for Typer, predication
requires computing the projection for the entire table for all
the selectivities, which pays off for 50% and 90%, but does
not pay off for 10% selectivity.
We also profiled predicated TPC-H, Q6 on Typer and Tec-
torwise, and we reached similar conclusions. While Typer’s
execution time is decreased by 11%, Tectoriwise’s execution
time is decreased by 52% thanks to predication.
Figure 21 shows the memory bandwidth utilization. As
can be seen, Typer’s memory bandwidth utilization is high
similar to the projection query, and remains stable across the
selectivities. This is because Typer’s predicated selection
queries only involve scanning the predicated and the pro-
jected columns and making arithmetic operations on them.
On the other hand, Tectorwise’s bandwidth utilization is
lower than that of Typer. This is because of the material-
ization overheads. In addition, the bandwidth utilization of
Tectorwise varies across the selectivities. The reason for the
8
UN
DE
R
SU
BM
IS
SI
ON

	–

 
!
#
$

 
( "( %( ( "( %(
 

 	
 
 
 
 
 
 
  



Figure 21: Single-core bandwidth utilization for
Typer and Tectorwise when running the predicated
selection queries.
highest bandwidth utilization at 50% is that the data access
pattern is the most confusing for the hardware prefetchers at
the 50% selectivity. As a result, hardware prefetchers create
unnecessary memory traffic resulting in higher bandwidth
consumption. The bandwidth utilization of 90% selectiv-
ity is higher than that of 10% selectivity as more data is
processed at the 90% selectivity.
When not using predication, the bandwidth utilization
is 3, 5 and 5 GB/s for Typer and 2.5, 3 and 3 GB/s for
Tectorwise for 10%, 50% and 90% selectivities, respectively
(graph not shown). Hence, predication significantly im-
proves the bandwidth utilization both for Typer and Tector-
wise, though less for Tectorwise due to the materialization
overheads.
We also profiled the bandwidth utilization of the predi-
cated Q6. The results showed that the bandwidth utiliza-
tion of Typer increases from 4.7 to 6.9 GB/s, whereas the
bandwidth utilization of Tectorwise increases from 1 to 4.7
GB/s. Hence, predication significantly increases the mem-
ory bandwidth utilization for TPC-H, Q6, too.
Overall, the experiments in this section show that pred-
icated selection queries behave similarly to the projection
queries at the micro-architectural level. While Dcache and
Execution stalls are the main stall cycle bottlenecks, Typer
stresses the memory bandwidth close to the maximum, where-
as Tectorwise underutilizes the memory bandwidth due to
the materialization overheads.
8. SIMD
The second optimization we examine is SIMD. SIMD in-
structions are used to reduce the number of instructions re-
quired to perform arithmetic operations. We test Tector-
wise when running the projection, selection and join micro-
benchmarks with and without using the SIMD instructions.
As our Broadwell server does not support AVX-512 instruc-
tions, we do all the SIMD experiments on a Skylake server
supporting AVX-512 instructions.
Note that the Skylake server that we profile in this section
has a different memory hierarchy with a different maximum
memory bandwidth than that of the Broadwell server. As
a result, the reported values that do not use SIMD do not
exactly match with the values reported earlier in the paper
(see Section 2, Hardware subsection for more details).
8.1 Projection & Selection
In this section, we examine the projection and selection
micro-benchmarks. We use the predicated, branch-free ver-
sions of the selection queries as SIMD is more effective when
branch mispredictions are eliminated. Figure 22 shows the
normalized response time breakdown, where the response
time without SIMD is taken as the base.
simd 
0	
0.2	
0.4	
0.6	
0.8	
1	
W/o	
SIMD	
W/	
SIMD	
W/o	
SIMD	
W/	
SIMD	
W/o	
SIMD	
W/	
SIMD	
Small	 Medium	 Large	
N
or
m
al
ize
d	
re
sp
on
se
	ti
m
e	
Join	size	
Stall	 Retiring	
0	
0.2	
0.4	
0.6	
0.8	
1	
1.2	
1.4	
1.6	
W/o	
SIMD	
W/	
SIMD	
W/o	
SIMD	
W/	
SIMD	
W/o	
SIMD	
W/	
SIMD	
Small	 Medium	 Large	
N
or
m
al
ize
d	
st
al
l	t
im
e	
Join	size	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
0	
0.2	
0.4	
0.6	
0.8	
1	
W/o	
SIMD	
W/	
SIMD	
W/o	
SIMD	
W/	
SIMD	
W/o	
SIMD	
W/	
SIMD	
W/o	
SIMD	
W/	
SIMD	
Proj.	 Sel.	10%	 Sel.	50%	 Sel.	90%	
N
or
m
al
ize
d	
re
sp
on
se
	ti
m
e	 Stall	 Retiring	
0	
0.2	
0.4	
0.6	
0.8	
1	
1.2	
1.4	
1.6	
W/o	
SIMD	
W/	
SIMD	
W/o	
SIMD	
W/	
SIMD	
W/o	
SIMD	
W/	
SIMD	
W/o	
SIMD	
W/	
SIMD	
Proj.	 Sel.	10%	 Sel.	50%	 Sel.	90%	
N
or
m
al
ize
d	
st
al
l	t
im
e	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
Figure 22: Normalized response time breakdown for
Tectorwise when running projection and selection
queries with and without SIMD.
simd 
0	
0.2	
0.4	
0.6	
0.8	
1	
W/o	
SIMD	
W/	
SIMD	
W/o	
SIMD	
W/	
SIMD	
W/o	
SIMD	
W/	
SIMD	
Small	 Medium	 Large	
N
or
m
al
ize
d	
re
sp
on
se
	ti
m
e	
Join	size	
Stall	 Retiring	
0	
0.2	
0.4	
0.6	
0.8	
1	
1.2	
1.4	
1.6	
W/o	
SIMD	
W/	
SIMD	
W/o	
SIMD	
W/	
SIMD	
W/o	
SIMD	
W/	
SIMD	
Small	 Medium	 Large	
N
or
m
al
ize
d	
st
al
l	t
im
e	
Join	size	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
0	
0.2	
0.4	
0.6	
0.8	
1	
W/o	
SIMD	
W/	
SIMD	
W/o	
SIMD	
W/	
SIMD	
W/o	
SIMD	
W/	
SIMD	
W/o	
SIMD	
W/	
SIMD	
Proj.	 Sel.	10%	 Sel.	50%	 Sel.	90%	
N
or
m
al
ize
d	
re
sp
on
se
	ti
m
e	 Stall	 Retiring	
0	
0.2	
0.4	
0.6	
0.8	
1	
1.2	
1.4	
1.6	
W/o	
SIMD	
W/	
SIMD	
W/o	
SIMD	
W/	
SIMD	
W/o	
SIMD	
W/	
SIMD	
W/o	
SIMD	
W/	
SIMD	
Proj.	 Sel.	10%	 Sel.	50%	 Sel.	90%	
N
or
m
al
ize
d	
st
al
l	t
im
e	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
Figure 23: Normalized stall time breakdown for
Tectorwise when running projection and selection
queries with and without SIMD.
Figure 22 shows that SIMD reduces the response time by
22% for projection, and 42%, 23% and 21% for selection
for 10%, 50% and 90% selectivities. For all the four cases,
the figure shows that there is 70% to 87% decrease in the
amount of time spent for Retiring cycles. As Retiring cycles
are correlated to the number of retired instructions, reduced
Retiring cycles shows that SIMD successfully reduces the
number of retired instructions.
Figure 23 shows the normalized stall time breakdown,
where the stall time without using SIMD is taken as the
base. The figure shows that SIMD increases Dcache stalls
while reducing the Execution stalls. The decrease in the
Execution stalls is due to using the SIMD execution engine
to perform the arithmetic operations rather than the regu-
lar arithmetic logic units. The increase in the Dcache stalls
is because SIMD instructions stress the memory bandwidth
9
UN
DE
R
SU
BM
IS
SI
ON
more, which results in waiting more on Dcache stalls. There-
fore, SIMD instructions make the projection and predicated
selection queries more Dcache-stalls-bound than Execution-
stalls-bound.
Simd – b/w 
0	
2	
4	
6	
8	
10	
W/o	
SIMD	
W/	
SIMD	
W/o	
SIMD	
W/	
SIMD	
W/o	
SIMD	
W/	
SIMD	
W/o	
SIMD	
W/	
SIMD	
Proj.	 Sel.	10%	 Sel.	50%	 Sel.	90%	
Ba
nd
w
id
th
	(G
B/
s)
	 MAX	
0	
1	
2	
3	
4	
5	
6	
7	
W/o	SIMD	 W/	SIMD	
Ba
nd
w
id
th
	(G
B/
s)
	 MAX	
Figure 24: Single-core bandwidth utilization for
Tectorwise when running projection and selection
queries with and without SIMD.
Figure 24 shows the single-core bandwidth utilization for
the projection and selection micro-benchmarks. We observe
that SIMD significantly increases the bandwidth utilization
for almost all the cases. This shows that SIMD effectively
exploits the underutilized bandwidth.
SIMD does not significantly improve the bandwidth uti-
lization at 50% selectivity. This is because the data access
pattern is the most confusing for the hardware prefetchers
at the 50% selectivity. As a result, the consumed bandwidth
is already high when not using SIMD leaving only a small
room for SIMD to exploit. Overall, the experiments in this
section show that a vectorized engine, despite suffering from
vectorization overheads, can significantly stress the memory
bandwidth by using SIMD instructions.
8.2 Join
In this section, we examine the large-sized join micro-
benchmark. We compare only the hash table probing phases.
Figure 25 (left) shows the normalized response time break-
down, where the response time without SIMD is taken as
the base. Note that the response time breakdown includes
the stall time breakdown inside on the same graph.
Figure 25 (left) shows that SIMD reduces the response
time by 27%. The reason is both the reduced number of re-
tired instructions and the reduced Dcache stalls. We exam-
ine the memory bandwidth utilization in Figure 25 (right).
The figure shows that SIMD improves the utilization by
50%. The reduced data stalls and increased bandwidth uti-
lization show that SIMD effectively parallelizes the random
accesses of hash table probings.
9. PREFETCHERS
Section 3 and 7 have shown that the projection and pred-
icated selection queries suffer from Dcache stalls. Both the
projection and predicated selection queries are essentially
sequential scans of the relevant columns with a highly pre-
dictable data access pattern. Despite that, large Dcache
stalls raise the question how useful hardware prefetchers are.
Simd – large join only 
0.00	
0.20	
0.40	
0.60	
0.80	
1.00	
W/o	SIMD	 W/	SIMD	N
or
m
al
ize
d	
re
sp
on
se
	ti
m
e	
Execution	 Dcache	 Decoding	
Icache	 Branch	mis .	 Retiring	
0.0	
0.2	
0.4	
0.6	
0.8	
1.0	
W/o	SIMD	 W/	SIMD	
N
or
m
al
ize
d	
re
sp
on
se
	ti
m
e	
Execution	 Dcache	
Decoding	 Icache	
Branch	misp.	 Retiring	
0	
1	
2	
3	
4	
5	
6	
7	
W/o	SIMD	 W/	SIMD	
Ba
nd
w
id
th
	(G
B/
s)
	
MAX	
Figure 25: Left: Normalized response time break-
down for Tectorwise when running the large-sized
join query. Right: Single-core bandwidth utiliza-
tion for Tectorwise when running the large-sized join
query.
In this section, we study the four hardware prefetchers
that today’s server processors provide: L1 next line (L1
NL), L1 streamer (L1 Str.), L2 next line (L2 NL) and L2
streamer (L2 Str.) prefetchers. We turn on and off the
four hardware prefetchers and examine their effects on the
micro-architectural behavior. We examine the following six
configurations: (i) all hardware prefetchers are disabled, (ii)
only L1 NL enabled, (iii) only L1 Str. enabled, (iv) only L2
NL enabled, (v) only L2 Str. enabled, and (vi) all hardware
prefetchers are enabled (which is the default case for all the
experiments in the rest of the paper).
Prefetcher – yper – projection - combined 
0	
50	
100	
150	
200	
250	
300	
350	
All	
disabled	
L1	NL	 L1	Str.	 L2	NL	 L2	Str.	 All	
enabled	
Re
sp
on
se
	ti
m
e	
(m
s)
	
Execution	 Dcache	 Decoding	
Icache	 Branch	misp.	 Retiring	
Figure 26: Response time breakdown for the six
prefetcher configuration for Typer when running the
projection micro-benchmark with degree of four.
Figure 26 shows the response time breakdown for the six
prefetcher configurations we examine for Typer when run-
ning the projection query with degree of four. Observe that
the response time breakdown includes the stall time break-
down inside on the same graph.
The figure shows that prefetchers reduce the Dcache stalls
by 85%, and the response time by 73%. Moreover, L2
streamer is the most effective hardware prefetcher as it alone
reduces the response time and the stalls as much as all the
four prefetchers together do.
10
UN
DE
R
SU
BM
IS
SI
ON
We also examined the projection query on Tectoriwse, and
the branched and branch-free selection queries on Typer and
Tectorwise. The results agree with our findings for the pro-
jection query on Typer. Hence, we omit the graphs for these
experiments. We also examined the join micro-benchmark.
As expected, prefetchers are in general not so useful for the
join queries due to the large number of random data ac-
cesses. Prefetchers reduce the response time by ∼ 20% for
the large-sized join both for Typer and Tectorwise. We omit
the graphs for these queries.
Overall, the experiments in this section show that hard-
ware prefetchers are indeed effective in reducing Dcache
stalls for large-sequential-scan workloads. However, they are
nevertheless not fast enough, which results in 50 to 75% of
the CPU cycles spent on stalls.
10. MULTI-CORE EXECUTION
We lastly examine the hardware utilization for multi-core
execution. Most OLAP operations scale well across multi-
cores. As a result, we do not expect a big difference in
the micro-architectural behavior of the multi-core execu-
tion compared to the single-core execution. We use the four
TPC-H queries as they are more complex than the micro-
benchmark queries, and hence harder to scale. We profile
the systems at fourteen threads, i.e., at the number of cores
per socket, as they all have the highest performance at four-
teen threads.
Figure 27 and 28 show the CPU and stall cycles break-
downs. As can be seen, both CPU and stall cycles break-
downs are similar to the single-core breakdowns. While the
low-cardinality group by Q1 has the highest Retiring cy-
cles ratio both for Typer and Tectorwise, join-intensive Q9
has the lowest Retiring cycles ratio for Typer, and highly
selective filter Q6 has the lowest Retiring cycles ratio for
Tectorwise. While Execution stalls dominate Q1, Dcache
stalls dominate the rest of the queries for Typer and Tec-
torwise, except that Q6 is Branch misprediction dominated
for Tectorwise.
While multi-core execution does not create a significant
difference in the core micro-architectural behavior, it cre-
ates an increasing pressure on the multi-core memory band-
width. In Figure 29 and 30, we examine the bandwidth
utilization for the projection query with degree of four and
for the large-sized join query. We measure per-socket aver-
age memory bandwidth utilization. We choose the micro-
benchmarks as they are easier to scale, and hence, stress the
memory bandwidth more.
Figure 29 shows that Typer saturates the multi-core band-
width at eight cores. This shows that Typer’s performance is
bandwidth-limited after eight cores. Hence, using more than
eight cores for Typer when running the projection query
would waste the cores. Tectorwise saturates the bandwidth
at twelve cores. This is because Tectorwise underutilizes the
per-core bandwidth due to the materialization overheads.
At twelve cores, however, it reaches the maximum per-socket
memory bandwidth. Hence, using more than twelve cores
would waste the cores when Tectorwise running the projec-
tion query.
Figure 30 presents the bandwidth utilization when run-
ning the large-sized join query. As the figure shows, both
Typer and Tectorwise largely underutilize the memory band-
width. This is because costly hash computations are pre-
venting the system to create enough memory traffic. As a re-
Multi-core, tpch 
0%	
20%	
40%	
60%	
80%	
100%	
Q1	 Q6	 Q9	 Q18	 Q1	 Q6	 Q9	 Q18	
Typer	 Tectorwise	
CP
U
	c
yc
le
s	(
%
)	
Query	Id	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
Q1	 Q6	 Q9	 Q18	 Q1	 Q6	 Q9	 Q18	
Typer	 Tectorwise	
St
al
l	c
yc
le
s	(
%
)	
Query	Id	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
Figure 27: CPU cycles breakdown for multi-core ex-
ecution when running TPC-H queries on Typer and
Tectorwise.
Multi-cor , tp h 
0%	
20%	
40%	
60%	
80%	
100%	
Q1	 Q6	 Q9	 Q18	 Q1	 Q6	 Q9	 Q18	
Typer	 Tectorwise	
CP
U
	c
yc
le
s	(
%
)	
Query	Id	
Stall	 Retiring	
0%	
20%	
40%	
60%	
80%	
100%	
Q1	 Q6	 Q9	 Q18	 Q1	 Q6	 Q9	 Q18	
Typer	 Tectorwise	
St
al
l	c
yc
le
s	(
%
)	
Query	Id	
Execution	 Dcache	 Decoding	 Icache	 Branch	misp.	
Figure 28: Stall cycles breakdown for multi-core ex-
ecution when running TPC-H queries on Typer and
Tectorwise.
sult, Typer and Tectorwise saturates the multi-core CPU re-
sources before saturating the multi-core memory resources,
leaving the multi-core memory bandwidth underutilized.
We also profiled the bandwidth utilization of the TPC-
H queries (graph not shown). The results showed that the
bandwidth utilization varies between the high utilization of
the projection and the low utilization of the join micro-
benchmarks. While the predicated Q6 comes close to the
maximum sequential bandwidth (56 GB/s both for Typer
and Tectorwise), Q1, Q9 and Q18 exhibits a similar band-
width utilization to the join micro-benchmark due to their
low pressure on the memory bandwidth.
Overall, the experiments in this section show that the dis-
proportional compute and memory demands of the OLAP
systems result in underutilization of either the compute or
memory resources. Hence, analytical processing systems
should carefully schedule their compute and memory re-
sources to efficiently use the multi-core micro-architectural
resources.
The memory pressure of Tectorwise can be increased by
using SIMD when running the join query. If we apply the
SIMD improvement in Section 8, the multi-core bandwidth
utilization of Tectorwise would increase from 21 to 31.5
GB/s. Similarly, the bandwidth utilization of Typer and
11
UN
DE
R
SU
BM
IS
SI
ON
Multicore – b/w 
0	
10	
20	
30	
40	
50	
60	
70	
1	 4	 8	 12	 14	
Ba
nd
w
id
th
	(G
B/
s)
	
Number	of	threads	
Typer	
Tectorwise	
0	
10	
20	
30	
40	
50	
60	
70	
1	 4	 8	 12	 14	
Ba
nd
w
id
th
	(G
B/
s)
	
Number	of	threads	
Typer	
Tectorwise	
0	
10	
20	
30	
40	
50	
60	
1	 4	 8	 12	 14	
Ba
nd
w
id
th
	(G
B/
s)
	
Number	of	threads	
Typer	
Tectorwise	
MAX	 MAX	
MAX	
Figure 29: Multi-core bandwidth utilization for
Typer and Tectorwise when running the projection
query with degree of four.
Tectorwise can be improved by using hyper-threading. Our
analysis with hyper-threading showed that the bandwidth
utilization is improved by 1.3x both for Typer and Tec-
torwise. Hence, Tectorwise’s (together with SIMD) and
Typer’s bandwidth utilizations would raise up to 40 GB/s
and 27 GB/s when hyper-threading is enabled. While the
improvements are substantial, they are nevertheless below
the maximum random access bandwidth. Moreover, our
main finding of underutilization of compute or memory re-
sources in the face of disproportional demands still holds for
many scenarios.
11. RELATEDWORK
There is a large body of work on the micro-architectural
analysis of database workloads. Ailamaki et al. [2] and
Hardavellas et al. [7] present database workload charac-
terization both for analytical and transactional workloads.
Tozun et al. [29, 30] presents micro-architectural analysis of
disk-based OLTP systems. Sirin et al. [25] presents micro-
architectural analysis of a breadth of OLTP systems. Our
work complements all these studies by presenting an anal-
ysis of modern analytical processing systems on a modern
processor.
Kersten et al. [17] presents an analysis of vectorized and
compiled OLAP engines without getting deep into micro-
architectural analysis. Sompolski et al. [27] presents a com-
parison between vectorized and compiled engines in terms
of particular optimizations such as predication and SIMD.
Our work comprehensively extends and complements these
works focusing on a deep analysis of single- and multi-core
micro-architectural behavior of a breadth of OLAP systems.
Ferdman et al. [6] presents micro-architectural analysis
of a suite of cloud workloads, concluding that there is a
fundamental mismatch among what today’s server proces-
sors provide and what cloud workloads demand. Our work
agrees with this work, and extends its conclusions to modern
analytical processing systems.
Yasin et al. [32] introduces the Top-Down Micro-architec-
ture Analysis Methodology (TMAM), that is adopted by
Intel VTune as general exploration. Sirin et al. [26] presents
an improvement on Yasin et al. [32]’s methodology, which
is adopted by Intel VTune in version 2018 and onwards.
Multicore – b/w 
0	
10	
20	
30	
40	
50	
60	
70	
1	 4	 8	 12	 14	
Ba
nd
w
id
th
	(G
B/
s)
	
Number	of	threads	
Typer	
Tectorwise	
0	
10	
20	
30	
40	
50	
60	
70	
1	 4	 8	 12	 14	
Ba
nd
w
id
th
	(G
B/
s)
	
Number	of	threads	
Typer	
Tectorwise	
MAX	 MAX	
0	
10	
20	
30	
40	
50	
60	
1	 4	 8	 12	 14	
Ba
nd
w
id
th
	(G
B/
s)
	
Number	of	threads	
Typer	
Tectorwise	
MAX	
Figure 30: Multi-core bandwidth utilization for
Typer and Tectorwise when running the large-sized
join query.
Yasin et al. [33] analyzes cloud workloads. Sridharan
and Patel [28] examines the evaluation of workloads on the
popular data analysis language R over a commodity pro-
cessor. Awan et al. [3, 4] presents a micro-architectural
analysis of Spark at the micro-architectural level. Kanev et.
al. [14] presents a profiling study of scale out workloads at
the micro-architectural level. Our work complements these
studies by presenting an analysis of modern analytical pro-
cessing systems.
12. CONCLUSIONS
In this work, we evaluate the micro-architectural behav-
ior of a breadth of OLAP systems from different categories
of systems and execution models. We examine CPU cycles
and memory bandwidth utilizations. The results show that,
unlike traditional, commercial OLTP systems, traditional,
commercial OLAP systems do not suffer from instruction
cache misses. Nevertheless, they suffer from their large in-
struction footprint making them orders of magnitude slower
than high performance OLAP engines.
High performance engines execute a tight instructions str-
eam; however, they spend 25 to 82% of the CPU cycles on
stalls regardless the workload being sequential- or random-
access-heavy. Sequential-access-heavy workloads stress the
memory bandwidth so high that hardware prefetchers fall
behind resulting in high data cache stalls. Random access
workloads suffer from long-latency data stalls consuming the
majority of CPU cycles. Lastly, high performance OLAP en-
gines underutilize the multi-core CPU or memory resources
due to their disproportional compute and memory demands,
showing that analytical processing engines should carefully
schedule their compute and memory resources for efficient
multi-core micro-architectural utilization.
13. REFERENCES
[1] D. Abadi, P. Boncz, and S. Harizopoulos. The Design
and Implementation of Modern Column-Oriented
Database Systems. Now Publishers Inc., 2013.
[2] A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A.
Wood. DBMSs on a Modern Processor: Where Does
Time Go? VLDB, pages 266–277, 1999.
[3] A. J. Awan, M. Brorsson, V. Vlassov, and
E. Ayguade. Performance Characterization of
12
UN
DE
R
SU
BM
IS
SI
ON
In-Memory Data Analytics on a Modern Cloud Server.
BDCloud, pages 1–8, 2015.
[4] A. J. Awan, M. Brorsson, V. Vlassov, and
E. Ayguade. Micro-Architectural Characterization of
Apache Spark on Batch and Stream Processing
Workloads. BDCloud, pages 59–66, 2016.
[5] P. Boncz, T. Grust, M. van Keulen, S. Manegold,
J. Rittinger, and J. Teubner. MonetDB/XQuery: A
Fast XQuery Processor Powered by a Relational
Engine. SIGMOD, pages 479–490, 2006.
[6] M. Ferdman, A. Adileh, O. Kocberber, S. Volos,
M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu,
A. Ailamaki, and B. Falsafi. Clearing the Clouds: A
Study of Emerging Scale-out Workloads on Modern
Hardware. ASPLOS, pages 37–48, 2012.
[7] N. Hardavellas, I. Pandis, R. Johnson, N. Mancheril,
A. Ailamaki, and B. Falsafi. Database Servers on Chip
Multiprocessors: Limitations and Opportunities.
CIDR, pages 79–87, 2007.
[8] S. Idreos, F. Groffen, N. Nes, S. Manegold, K. S.
Mullender, and M. L. Kersten. MonetDB: Two
Decades of Research in Column-oriented Database
Architectures. IEEE Data Engineering Bulletin,
35(1):40–45, 2012.
[9] Intel. Disclosure of Hardware Prefetcher Control on
Some Intel Processors.
https://software.intel.com/en-us/articles/disclosure-
of-hw-prefetcher-control-on-some-intel-processors.
[10] Intel. Intel Memory Latency Checker.
https://software.intel.com/en-us/articles/intelr-
memory-latency-checker.
[11] Intel. Understanding How General Exploration Works
in Intel VTune Amplifier, 2018.
https://software.intel.com/en-
us/articles/understanding-how-general-exploration-
works-in-intel-vtune-amplifier-xe.
[12] Intel. Intel(R) 64 and IA-32 Architectures
Optimization Reference Manual, 2019.
[13] C. Jonathan, U. F. Minhas, J. Hunter, J. Levandoski,
and G. Nishanov. Exploiting Coroutines to Attack the
”Killer Nanoseconds”. Proc. VLDB Endow.,
11(11):1702–1714, July 2018.
[14] S. Kanev, J. P. Darago, K. Hazelwood,
P. Ranganathan, T. Moseley, G. Wei, and D. Brooks.
Profiling a warehouse-scale computer. ISCA, pages
158–169, 2015.
[15] M. Karpathiotakis, I. Alagiannis, and A. Ailamaki.
Fast Queries over Heterogeneous Data Through
Engine Customization. Proc. VLDB Endow.,
9(12):972–983, Aug. 2016.
[16] A. Kemper and T. Neumann. Hyper: A hybrid oltp
olap main memory database system based on virtual
memory snapshots. ICDE, pages 195–206, 2011.
[17] T. Kersten, V. Leis, A. Kemper, T. Neumann,
A. Pavlo, and P. Boncz. Everything You Always
Wanted to Know About Compiled and Vectorized
Queries but Were Afraid to Ask. Proc. VLDB Endow.,
11(13):2209–2222, Sept. 2018.
[18] T. Lahiri, S. Chavan, M. Colgan, D. Das, A. Ganesh,
M. Gleeson, S. Hase, A. Holloway, J. Kamp, T. Lee,
J. Loaiza, N. Macnaughton, V. Marwah,
N. Mukherjee, A. Mullick, S. Muthulingam, V. Raja,
M. Roth, E. Soylemez, and M. Zait. Oracle Database
In-Memory: A Dual Format In-memory Database.
ICDE, pages 1253–1258, 2015.
[19] P.-A. Larson, C. Clinciu, E. N. Hanson, A. Oks, S. L.
Price, S. Rangarajan, A. Surna, and Q. Zhou. SQL
Server Column Store Indexes. SIGMOD, pages
1177–1184, 2011.
[20] S. Manegold, P. A. Boncz, and M. L. Kersten.
Optimizing Main-Memory Join on Modern Hardware.
IEEE Trans. Knowl. Data Eng., 14(4):709–730, 2002.
[21] G. Psaropoulos, T. Legler, N. May, and A. Ailamaki.
Interleaving with Coroutines: A Practical Approach
for Robust Index Joins. PVLDB, 11(2):230–242, 2017.
[22] G. Psaropoulos, T. Legler, N. May, and A. Ailamaki.
Interleaving with Coroutines: A Systematic and
Practical Approach to Hide Memory Latency in Index
Joins. The VLDB Journal, Dec 2018.
[23] G. Psaropoulos, I. Oukid, T. Legler, N. May, and
A. Ailamaki. Bridging the Latency Gap between NVM
and DRAM for Latency-bound Operations. pages
13:1–13:8, 2019.
[24] V. Raman, G. Attaluri, R. Barber, N. Chainani,
D. Kalmuk, V. KulandaiSamy, J. Leenstra,
S. Lightstone, S. Liu, G. M. Lohman, T. Malkemus,
R. Mueller, I. Pandis, B. Schiefer, D. Sharpe, R. Sidle,
A. Storm, and L. Zhang. DB2 with BLU Acceleration:
So Much More Than Just a Column Store. Proc.
VLDB Endow., 6(11):1080–1091, Aug. 2013.
[25] U. Sirin, P. To¨zu¨n, D. Porobic, and A. Ailamaki.
Micro-architectural Analysis of In-memory OLTP.
SIGMOD, pages 387–402, 2016.
[26] U. Sirin, A. Yasin, and A. Ailamaki. A Methodology
for OLTP Micro-architectural Analysis. Damon, pages
1:1–1:10, 2017.
[27] J. Sompolski, M. Zukowski, and P. A. Boncz.
Vectorization vs. Compilation in Query Execution.
Damon, pages 33–40, 2011.
[28] S. Sridharan and J. M. Patel. Profiling R on a
Contemporary Processor. Proc. VLDB Endow.,
8(2):173–184, Oct. 2014.
[29] P. To¨zu¨n, B. Gold, and A. Ailamaki. OLTP in
wonderland: Where do cache misses come from in
major OLTP components? Damon, page 8, 2013.
[30] P. To¨zu¨n, I. Pandis, C. Kaynak, D. Jevdjic, and
A. Ailamaki. From A to E: Analyzing TPC’s OLTP
Benchmarks: The Obsolete, The Ubiquitous, The
Unexplored. EDBT, pages 17–28, 2013.
[31] TPC. Transcation Processing Performance Council.
http://www.tpc.org/.
[32] A. Yasin. A Top-Down Method for Performance
Analysis and Counters Architecture. ISPASS, pages
35–44, 2014.
[33] A. Yasin, Y. Ben-Asher, and A. Mendelson. Deep-dive
Analysis of The Data Analytics Workload in
CloudSuite. IISWC, pages 202–211, 2014.
13
