How to Stop Under-Utilization and Love Multicores by Ailamaki, Anastasia et al.
How to stop underutilization 
and love multicores 
Anastasia Ailamaki  (EPFL) 
Erietta Liarou (EPFL) 
Pınar Tözün (EPFL) 
Danica Porobic (EPFL) 
Iraklis Psaroudakis (EPFL, SAP AG) 
i  multicores 
once upon a time … 
2 processor stalled >50% of the time 
[VLDB99] 
i  multicores 
Moore’s law 
3 
doubling of transistor counts continues 
clock speeds and power hit the wall 
i  multicores 
processor trends 
4 
core core core core 











core core core core 
Core Core Core Core 
core core core core 
core Core Core Core 
core core core core 
core core core core 
i  multicores 




































now: cores & cache utilization 
6 
at peak throughput on Shore-MT, Intel Xeon X5660 
Maximum 
IPC < 1 on a 4-issue machine 

































i  multicores 
horizontal dimension: cores & sockets 
7 exploit abundant parallelism 


























i  multicores 


















number of threads 
access latency memory bandwidth 
i  multicores 
stopping underutilization 
• how to adapt traditional execution models to 
fully exploit modern hardware? 
 
• how to maximize data & instruction locality at 
the right level of the memory hierarchy? 
 
• how to continue scaling-up despite many 
cores and non-uniform topologies? 
 
9 
i  multicores 
10 
utilization 
exploiting core’s resources 
minimizing memory stalls 
scalability 
scaling up OLTP 
















































2 3 4 5 6 7 8 9 10 11 121clockcycle
subscalar CPUs
CPU
fetch execute mem writedecode
fetch decode





one instruction at a time
i""mul&cores"
2 3 4 5 6 7 8 9 10 11 121clockcycle
subscalar CPUs
CPU
fetch execute mem writedecode
fetch decode





one instruction at a time
... ten cycles to complete 2 instructions!
i""mul&cores"
2 3 4 5 6 7 8 9 10 11 121clockcycle
subscalar CPUs
CPU
fetch execute mem writedecode
fetch decode






one instruction at a time
fetch execute mem writedecode
i""mul&cores"
2 3 4 5 6 7 8 9 10 11 121clockcycle
subscalar CPUs
CPU
fetch execute mem writedecode
fetch decode






one instruction at a time
fetch execute mem writedecode
i""mul&cores"
2 3 4 5 6 7 8 9 10 11 121clockcycle
subscalar CPUs
CPU
fetch execute mem writedecode
fetch decode






one instruction at a time
fetch execute mem writedecode
i""mul&cores"






fetch execute mem writedecode
fetch execute mem writedecode
fetch executedecode
Instruction pipelining:
2 3 4 5 6 7 8 9 10 11 121clockcycle
14
multiple instructions can be partially overlapped 
i""mul&cores"






fetch execute mem writedecode
fetch execute mem writedecode
fetch execute mem writedecode
fetch execute mem writedecode
Increase(the(u,liza,on(of(on2die(execu,on(resources
Instruction pipelining:
2 3 4 5 6 7 8 9 10 11 121clockcycle
14
multiple instructions can be partially overlapped 
i""mul&cores"






fetch execute mem writedecode
fetch execute mem writedecode
fetch execute mem writedecode
fetch execute mem writedecode
Increase(the(u,liza,on(of(on2die(execu,on(resources
Instruction pipelining:
2 3 4 5 6 7 8 9 10 11 121clockcycle
14
multiple instructions can be partially overlapped 
i""mul&cores"
fundamental way to parallelize
CPU
fetch execute mem writedecode
fetch execute mem writedecode
fetch execute mem writedecode
fetch execute mem writedecode
Instruction pipelining:





2 3 4 5 6 7 8 9 10 11 121clockcycle
15
multiple instructions can be partially overlapped 
i""mul&cores"
fetch execute mem writedecode
fundamental way to parallelize
CPU
fetch execute mem writedecode
fetch execute mem writedecode
fetch execute mem writedecode
fetch execute mem writedecode
Instruction pipelining:





2 3 4 5 6 7 8 9 10 11 121clockcycle
Increase(the(instruc,on(throughput
15
multiple instructions can be partially overlapped 
i""mul&cores"
superscalar cpu
fetch decode execute mem write
fetch decode execute mem write






more than one instructions during a clock cycle
fetch decode execute mem write
fetch decode execute mem write
fetch decode execute mem write
instr5
instr6
instr7 fetch decode execute mem


















fetch execute mem writedecode
fetch execute mem writedecode
fetch execute mem writedecode
fetch execute mem writedecode
fetch execute mem writedecode
fetch execute mem writedecode
fetch execute mem writedecode
fetch execute mem writedecode
fetch execute mem writedecode
fetch execute mem writedecode
fetch execute mem writedecode















SIMD (single instruction multiple data)
processing large arrays of numeric values
e.g., the same value is being added to a large number of data points
19
Apply an instruction to multiple data elements
• allows parallelism
• process of K elements at a time          speedup of K
SIMD
i""mul&cores"
SIMD (single instruction multiple data)
20
i""mul&cores"
SIMD (single instruction multiple data)
































A1   A2   A3   A4
op
B1   B2   B3   B4
op op op
R1   R2   R3   R4
SISD to SIMD
apply the same action on 
multiple data values 
with the same cost as for 1 value
21
i""mul&cores"
1     5    10    3
add
2    21    1    2
3   26   11   5
SIMD (single instruction multiple data)








0,   4294967295,    0,     4294967295
SIMD (single instruction multiple data)
runs at 





3 5 10 3




0,   4294967295,    0,     4294967295
SIMD (single instruction multiple data)
runs at 





3 5 10 3





SIMD (single instruction multiple data)

















































































A SMT processor pretends to be multiple logical processors










A SMT processor pretends to be multiple logical processors








if one thread stalls another one can continue  
i""mul&cores"
SMT (simultaneous multithreading)
A SMT processor pretends to be multiple logical processors








if one thread stalls another one can continue  
“30% performance gain” --Intel
i""mul&cores"










L2 competition for execution units
[VLDB05b]
i""mul&cores"





















share input and output data in the cache
reimplementation of dbms operators












share input and output data in the cache
reimplementation of dbms operators
odd tuples
even tuples













A’ preload data elements that will soon be needed
L2
main worker helper
use one thread for the computation 











A’ preload data elements that will soon be needed
L2
main worker helper
use one thread for the computation 

















A’ preload data elements that will soon be needed
L2
main worker helper
use one thread for the computation 













A SMT processor pretends to be multiple logical processors
(one per instruction stream).
better than single threaded:
• increase thread-level parallelism
• improve processor utilization when one thread blocks
not as good as two physical cores























L1 L1 L1 L1
35
i""mul&cores"




























L1 L1 L1 L1
1 core for each query
36
[VLDB08b]
c0 c1 c2 c3
i""mul&cores"
scan in multicores





L1 L1 L1 L1
1 core for each query Queries

































L1 L1 L1 L1


















L1 L1 L1 L1





















L1 L1 L1 L1
1 core for each table scan
{Q1,...,Qn} {Q1,...,Qn} {Q1,...,Qn}
Queries





























































































A0 or B0 























































































































































































































bitonic merge kernel with SIMD
[VLDB08a]
i""mul&cores"























SIMD ...SIMD SIMDSIMD SIMD
...










SIMD ...SIMD SIMDSIMD SIMD
...
...


































sorting on multicore SIMD
[VLDB08a]
core
2 cores work simultaneously 














sorting on multicore SIMD
[VLDB08a]
core
2 cores work simultaneously 














sorting on multicore SIMD
[VLDB08a]
core
2 cores work simultaneously 



























i  multicores 
50 
utilization 
exploiting core’s resources 
minimizing memory stalls 
scalability 
scaling up OLTP 


























L3 / LLC 
L1-I L1-D 
today’s memory hierarchy 
51 





stalls  wasted power & $$$$ 
i  multicores 






















































CloudSuite on Intel Xeon X5670 
[ASPLOS12] 















graph courtesy of Ferdman et al. 
~1 instructions per cycle 
i  multicores 
sources of memory stalls 

















































































i  multicores 
• 50%-80% of cycles are stalls 
– Problem: 
instruction fetch & long-latency data misses 
– Instructions need more capacity 
– Data misses are compulsory 
 
• Focus on maximizing: 
– L1-I locality & cache line utilization for data 
 
for data intensive applications … 
54 
i  multicores 
minimizing memory stalls 
55 
being cache conscious 
code optimizations 
alternative data structures/layout 
vectorized execution 
 









i  multicores 
prefetching – lite 
• next-line: miss A  fetch A+1 
• stream: miss A, A+1  fetch A+2, A+3 
 
favors sequential access & spatial locality 
 instructions: branches, function calls 
• branch prediction 
 data: pointer chasing 
• stride: miss A, A+20  fetch A+40, A+60 
 
56 
… or text-book prefetching 
[ISCA90, MICRO00] 
though, memory stalls are still too high 
preferred on real hardware due to simplicity 













 . . . cache 
accesses 
time 




slide courtesy of Cansu Kaynak 














 . . . cache 
accesses 
time 
 . . . X Y Z A C D C  . . . X Y Z A C D C 
58 
[ISCA05, MICRO13a] 
high space cost 
exploits recurring control flow 
slide courtesy of Cansu Kaynak 









only for data on real hardware 
i  multicores 
minimizing memory stalls 
60 
being cache conscious 
code optimizations 
alternative data structures/layout 
vectorized execution 
 









i  multicores 
code optimizations 
• simplified code 
– in-memory databases have smaller instruction footprint 
 
• better code layout 
– minimize jumps  exploit next line prefetcher 
– profile-guided optimizations (static) 
– just-in-time (dynamic) 
 
• query compilation into machine/naïve code 
– e.g., HyPer, Hekaton, MemSQL 
61 
[ISCA01, ICDE10, PVLDB11a, SIGMOD13a] 
i  multicores 
cache conscious data layouts 
62 
16 bytes columns 
cache lines (64bytes) 
 
goal: 
maximize cache line utilization & 
exploit next-line prefetcher 
 
row stores: good for OLTP 
accessing many columns 
column stores: good for OLAP 
accessing a few columns 







row store erietta blue pinar black 
erietta pinar column store danica iraklis 
i  multicores 
cache conscious data structures 
63 
in memory index tree 
exploit next-line prefetcher in tree probe 




+ align nodes to cache lines 
i  multicores 

















. . . . . . 







 allows exploiting SIMD 


















. . . . . . 
i  multicores 
minimizing memory stalls 
66 
being cache conscious 
code optimizations 
alternative data structures/layout 
vectorized execution 
 









i  multicores 
instruction & data overlap 












TPC-C (100GB data) on Shore-MT 
overlapping cache blocks cold hot 
67 higher overlap in same-type transactions 
overlap: significant for instructions & low for data 


































need to track recent misses and cache contents 
exploits aggregate L1-I & instruction overlap 
i  multicores 
summary 
• DBMSs underutilize a core’s resources 
• Problem 1: L1-I misses 
– due to capacity 
– minimized footprint & 
illusion of a larger cache by maximizing re-use 
• Problem 2: LLC data misses 
– compulsory 
– maximize cache-line utilization through 
cache-conscious algorithms and layout 
 
69 
i  multicores 
70 
utilization 
exploiting core’s resources 
minimizing memory stalls 
scalability 
scaling up OLTP 





















i  multicores 


















number of threads 
access latency memory bandwidth 
i  multicores 
73 
critical path of transaction execution 




many accesses to shared data structures 
i  multicores 



















unpredictable data accesses 
clutter code with critical sections -> contention 
[PVLDB10b] 
i  multicores 
critical sections 
75 
Updating 1 row 





























i  multicores 
critical section types 
unbounded fixed cooperative 
locking, latching transaction manager logging 
76 
[VLDBJ14] 
unbounded  fixed / cooperative 
i  multicores 










i  multicores 
78 
hot shared locks cause contention 
lock manager 
trx1 trx2 trx3 
agent thread execution 
hot lock 
cold lock 
release and request the same locks repeatedly 
i  multicores 
79 
lock manager 
trx1 trx2 trx3 
agent thread execution 
hot lock 
cold lock 
speculative lock inheritance 
commit without  
releasing hot locks 
seed lock list  
of next trx 
[VLDB09b] 
significantly reduces lock contention 
i  multicores 
lightweight intent locks 
80 
• hottest locks in the system are intent locks 
 
• few intent locks -> high contention 
 
• lightweight intent locks: 
– counters in data pages 
– updated atomically 
– lower overhead than SLI 
[ADMS12] 
i  multicores 
data-oriented transaction execution 
81 
[PVLDB10b] 
convert centralized locking to thread-local 







Local Lock Table 
Pref LM Own Wait 
A A B 
A {1,0} EX A 
{1,3} EX B 
A 




i  multicores 




















i  multicores 


















83 predictable data accesses 
[PVLDB10b] 
i  multicores 
modern shared-nothing systems 
• physical data partitioning 
• single threaded execution: no locking or latching 
• main-memory optimized: no buffer pool 
• support persistence on disk 
• durability through replication or logical logging 
 
• main challenge: concurrency with multi-site and 
long running transactions 
84 
[ICDE14c] 
[VLDB07b, ICDE11, SIGMOD12] 
i  multicores 
modern shared-nothing systems 
• H-Store/VoltDB 
– extreme fine-grained shared-nothing 
– speculative optimistic concurrency control 
• HyPer 
– OLAP support through VM snapshots 
– strict timestamp ordering 
– tentative execution for long running transactions 
– implicit locking with hardware transactional memory 
• Calvin 
– deterministic execution model with conflict detection 










i  multicores 
multiversion concurrency control 
86 
• scalable serializable snapshot isolation 
– latch-free validation phase using atomic ops 
• distributed snapshot isolation in SAP HANA 
– snapshot tokens, local-only transactions and write buffering 
 
• Hekaton 
– OCC with parallel validation and commit dependency tracking 
• Silo 





i  multicores 










i  multicores 
data access in centralized B-tree 
heap 
index 
88 conflicts on both index and heap pages 
i  multicores 
range worker 
A – M  
N – Z  








i  multicores 







• bulk synchronous parallel processing model 
• point-to-point synchronization 
• software-prefetching and SIMD 
[PVLDB11c] 
figure courtesy of Jason Sewall 







• latch-free log-structured B-tree 
• optimized for both main memory and flash 




figure courtesy of Justin Levandoski 
i  multicores 










i  multicores 
WAL: gatekeeper of the DBMS 
• write ahead logging is a performance enabler 








logging is completely serial – by design 
i  multicores 










A serialize at the log head 
B 
B I/O delay to harden the commit record 
C 
C serialize on incompatible lock 
END 
i  multicores 
 
• early lock release 
– can be improved further with control lock violation 
 
• flush pipelining 
– reduces context switches 
 
• consolidation array 
– minimize log contention 





















i  multicores 










i  multicores 
other unbounded communication 
• critical sections protect log buffer, stats, lock 







synchronization required for one index probe 
diverse use cases – how to select the best primitive? 
i  multicores 
lock-based approaches 
blocking OS mutex 
 simple to use  overhead, unscalable 
reader-writer lock 
 concurrent readers  overhead 
queue-based spinlock (“MCS”) 
 scalable  memory management 
test and set spinlock (TAS) 
 efficient  unscalable 
98 
i  multicores 
lock-free approaches 
optimistic concurrency control (OCC) 
 low read overhead   writes cause livelock 
atomic updates 
 efficient  limited applicability 
lock-free algorithms 
 scalable  special-purpose algos 
hardware transactional memory 
 efficient, scalable  not widely available 
99 
i  multicores 
synchronization “cheat sheet” 
 OS blocking mutex: only for scheduling 
 reader-writer lock: dominated by OCC/MCS 












i  multicores 












































socket 0 socket 1 
communication latencies vary by order-of-magnitude 
i  multicores 
OLTP on Hardware Islands 
shared-everything shared-nothing Island shared-nothing 
 stable 
 not optimal 
 
 fast 
 sensitive to workload 
 
 robust middle ground 
 
• challenges 
– optimal configuration depends on workload and hardware 





i  multicores 







Probe A Probe B 
i  multicores 
• identify bottlenecks in existing systems 
– eliminate bottlenecks systematically and holistically 
• design new system from the ground up 
– without creating new bottlenecks 
• do not assume uniformity in communication 
• choose the right synchronization mechanism 
105 
scaling up OLTP 
i  multicores 
106 
utilization 
exploiting core’s resources 
minimizing memory stalls 
scalability 
scaling up OLTP 
































Scaling up OLAP 
107 





OLAP is concerned also with resources saturation 
sharing across queries 
mitigates 
saturation 
i  multicores 
















numerous points to consider for NUMA-awareness 
(3) remote access 
latency (1.5x local) 
figure courtesy of Blagodurov et al. 
i  multicores 











i  multicores 
sharing is caring… 
110 




i  multicores 
reactive sharing proactive sharing 
•global query plan 
with shared 
operators 
• shared scans 
 
•query-centric 
• shares common 
sub-plans 










111 how and when should we use each technique? 
[SIGMOD14b] 
i  multicores 
[VLDB07a, PVLDB13b] 
























by pulling shared intermediate results 







A columns 11 
SELECT * FROM Α, Β 
WHERE Α.c1 = Β.c1  
AND  AND 
SELECT * FROM Α, Β 
WHERE Α.c1 = Β.c1  
AND  AND 
σ σ 
01 B columns 
A columns B columns 01 
+ bitwise AND 
shared operators can support high throughput 
[VLDB09a] 
i  multicores 
11 11 
11 







SELECT * FROM Α, Β 
WHERE Α.c1 = Β.c1  
AND  AND 
SELECT * FROM Α, Β 
WHERE Α.c1 = Β.c1  
AND  AND 
σ σ 
B columns 
A columns B columns 
+ bitwise AND 
bits are always 
the same 
[PVLDB13b] 
i  multicores 
11 11 
11 







SELECT * FROM Α, Β 
WHERE Α.c1 = Β.c1  
AND  AND 
SELECT * FROM Α, Β 
WHERE Α.c1 = Β.c1  
AND  AND 
σ σ 
B columns 
A columns B columns 




reactive sharing can improve proactive sharing 
[PVLDB13b] 
i  multicores 













reactive proactive (global query plan) 
execution dynamic dynamic dynamic batched 














i  multicores 
share responsibly 
117 
demo on wed 15:00  
& thu 10:30 
[PVLDB13b, SIGMOD14b] 
when to share how to share 
low concurrency 
query-centric operators  
+ reactive sharing 
high concurrency 
proactive sharing  
+ reactive sharing 
i  multicores 











i  multicores 
application-agnostic NUMA-awareness 
• black box approach 
– monitoring to predict behavior 
• DINO scheduler 
– moves threads and their data to balance cache load 
• Carrefour 
– re-organizes data to avoid memory bottlenecks 
– by: replicating, interleaving or co-locating data 
119 
[HPCA13, USENIX11, ASPLOS13] 
not always optimal for DBMS 
i  multicores 
impact of NUMA 
• data partitions accessed by different clients  
– co-locate threads and data they access 
120 
[BTW13] 
up to 75% improvement 
figure courtesy of Kiefer et al. 
i  multicores 
data shuffling  
• N threads, each partitions its local data into N 
equally-sized pieces, transmitted to the rest 
• naïve method: 
121 
[CIDR13b] 
saturates memory and interconnects 
s1.p1 s1.p2 s2.p1 s2.p2 s3.p1 s3.p2 s4.p1 s4.p2 
s1.c1 s1.c2 s2.c1 s2.c2 s3.c1 s3.c2 s4.c1 s4.c2 
step 1 
s1.p1 s1.p2 s2.p1 s2.p2 s3.p1 s3.p2 s4.p1 s4.p2 




i  multicores 
coordinated shuffling 
122 
inner ring fixed 
outer ring rotates 
balances memory and interconnect traffic 
[CIDR13b] 
i  multicores 
radix hash join 
123 
[VLDB09c] 
cache-efficient but not NUMA-aware 
partitions (by key) are small 
enough to fit into cache 
figure courtesy of Kim et al. 
i  multicores 
• NUMA-awareness rules: 
– no remote random writes 
– sequential remote reads 
– no synchronization 
massively parallel sort-merge join 
124 
[PVLDB12a] 
remote random accesses > remote scans 
faster than radix hash-join 
for star schemas 
figure courtesy of Albutiu et al. 
i  multicores 
























suffers from bandwidth 
saturation for general schemas 
multi-way merging with 
task scheduling to balance 
CPU and memory 
radix hash join 
still superior 
a long-standing battle 
i  multicores 











i  multicores 


















Context switch Cache thrashing Overutilization 
i  multicores 










socket 1 socket 2 task queues 
[ADMS13] 










Context switch Cache thrashing Overutilization 
i  multicores 
opportunities and challenges 
129 
[ADMS13, DSAA14, ICDE13a, PCS13] 
challenges solutions 







task granularity depending on saturation 
opportunities advantages 
decouple from OS full control and predictability 
task granularity 
balance CPU and memory 
parallelism 
task prioritization workload management 
i  multicores 
task scheduling for OLAP 
130 
[SIGMOD14a] 
figure courtesy of Leis et al. 
i  multicores 
embrace… 
• sharing 
– reduces contention for resources 
– reactive and proactive 
• NUMA-awareness 
– reduce latency and avoid bottlenecks 
– data placement and thread scheduling 
– black box approach not optimal 
– algorithms 
• task scheduling 
– abstract resources and utilize them efficiently 
131 …to scale up OLAP 
i  multicores 
132 
utilization 
exploiting core’s resources 
minimizing memory stalls 
scalability 
scaling up OLTP 









i  multicores 
exploiting hardware requires 
– utilizing the resources of a core 
– taking advantage of parallelism 
– optimally managing the memory 
art of scheduling 
– adjust your task granularity 
– optimize locality at the right level 
– avoid saturation 
road to scalability 
– eliminate all unbounded communication 
133 
concluding remarks 
bridge the gap between software & hardware 
i  multicores 














 Transistor Scaling (Moore's Law)
 Supply Voltage (ITRS)
age of dark silicon is upon us! 
exponential increase in unusable area on chips 
[MICRO11, USENIX12] 
graph courtesy of Hardavellas et al. 
i  multicores 
[ISCA14] 
exploiting dark silicon 
135 toward specialized hardware 
• Meet the walkers 
• Database processing unit 
• Programmable 
accelerators 
• Bionic databases 
• Reconfigurable datacenters 






i  multicores 
open questions – How to … 
• fit NVRAM to memory hierarchy? 
• exploit HTM? 
• adapt the whole software stack (OS + 
applications) to hardware specialization? 
• take advantage of compilers? 
• design concurrency-control for many-cores? 








i  multicores 
references 
[ADMS12] H. Kimura, G. Graefe, and H. Kuno: Efficient Locking Techniques for Databases on 
Modern Hardware. 
[ADMS13] I. Psaroudakis, T. Scheuer, N. May, and A. Ailamaki: Task Scheduling for Highly 
Concurrent Analytical and Transactional Main-Memory Workloads. 
[ASPLOS06] K. Chakraborty, P. M. Wells, and G. S. Sohi: Computation spreading: employing 
hardware migration to specialize cmp cores on-the-ﬂy.  
[ASPLOS12] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. 
D. Popescu, A. Ailamaki, and B. Falsaﬁ: Clearing the Clouds: A Study of Emerging Scale-out 
Workloads on Modern Hardware. 
[ASPLOS13] M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quema, and M. 
Roth: Traﬃc management: A holistic approach to memory placement on numa systems. 
[ASPLOS14] L. Wu, A. Lottarini, T. K. Paine, M. A. Kim, and K. A. Ross: Q100: The Architecture and 
Design of a Database Processing Unit. 
[BTW13] T. Kiefer, B. Schlegel, and W. Lehner: Experimental Evaluation of NUMA Effects on 
Database Management Systems 
[CIDR05] P. A. Boncz, M. Zukowski, and N. Nes: MonetDB/X100: Hyper-Pipelining Query 
Execution. 
[CIDR13a] R. Johnson and I. Pandis: The bionic DBMS is coming, but what will it look like? 
 
 137 
i  multicores 
references 
[CIDR13b] Y. Li, I. Pandis, R. Mueller, V. Raman, and G. Lohman: NUMA-aware Algorithms: The 
Case of Data Shuﬄing. 
[CIDR13c] H. Mühe, A. Kemper, and T. Neumann: Executing Long-Running Transactions in 
Synchronization-Free Main Memory Database Systems. 
[DaMoN13] P. Tözün, B. Gold, and A. Ailamaki: OLTP in Wonderland -- Where do cache misses 
come from in major OLTP components? 
[DaMoN14] I. Psaroudakis, T. Kissinger, D. Porobic, T. Ilsche, E. Liarou, P. Tözün, A. Ailamaki, and 
W. Lehner: Dynamic Fine-Grained Scheduling for Energy-Efficient Main-Memory Queries 
[DSAA14] J. Wust, M. Grund, K. Hoewelmeyer, D. Schwalb, and H. Plattner: Concurrent Execution 
of Mixed Enterprise Workloads on In-Memory Databases. 
[EDBT13] P. Tözün, I. Pandis, C. Kaynak, D. Jevdjic, and A. Ailamaki: From A to E: Analyzing TPC’s 
OLTP Benchmarks – The obsolete, the ubiquitous, the unexplored. 
[Eurosys12] Y. Mao, E. Kohler, and R. Morris: Cache Craftiness for Fast Multicore Key-Value 
Storage. 
[Eurosys14] Z. Wang, H. Qian, J. Li, and H. Chen: Using Restricted Transactional Memory to Build 
a Scalable In-Memory Database. 
[HPCA13] L. Tang, J. Mars, X. Zhang, R. Hagmann, R. Hundt, and E. Tune: Optimizing Google's 
warehouse scale computers: The NUMA experience. 
138 
i  multicores 
references 
[ICDE10] K. Krikellas, S. D. Viglas, M. Cintra: Generating code for holistic query evaluation. 
[ICDE11] A. Kemper and T. Neumann: HyPer – a hybrid OLTP&OLAP main memory database 
system based on virtual memory snapshots. 
[ICDE13a] J. Dees and P. Sanders: Efficient many-core query execution in main memory column-
stores 
[ICDE13b] J. Lee, Y. Kwon, F. Farber, M. Muehle, C. Lee, C. Bensberg, J. Lee, A. Lee, and W. 
Lehner: SAP HANA Distributed In-Memory Database System: Transaction, Session and Metadata 
Management 
[ICDE13c] J. Levandoski, D. Lomet, and S. Sengupta: The Bw-Tree: A B-tree for new hardware 
platforms. 
[ICDE14a] H. Han, S. Park, H. Jung, A. Fekete, U. Roehm, and H. Yeom : Scalable Serializable 
Snapshot Isolation for Multicore Systems. 
[ICDE14b] V. Leis, A. Kemper, and T. Neumann: Exploiting Hardware Transactional Memory in 
Main-Memory Databases. 
[ICDE14c] N. Malviya, A. Weisberg, S. Madden, and M. Stonebraker: Rethinking Main Memory 
OLTP Recovery. 
[ICDE14d] D. Porobic, E. Liarou, P. Tözün, and A. Ailamaki: ATraPos: Adaptive Transaction 
Processing on Hardware Islands. 
139 
i  multicores 
references 
[IMDM13] S. Wolf, H. Mühe, A. Kemper, and T. Neumann: An evaluation of strict timestamp 
ordering concurrency control for main-memory database systems. 
[ISCA90] N. P. Jouppi: Improving Direct-mapped Cache Performance by the Addition of a Small 
Fully-associative Cache and Prefetch Buffers. 
[ISCA01] A. Ramirez, L. A. Barroso, K. Gharachorloo, R. Cohn, J. Larriba-Pey, P. G. Lowney, and M. 
Valero: Code Layout Optimizations for Transaction Processing Workloads.  
[ISCA05] T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi: Temporal 
Memory Streaming of Shared Memory. 
[ISCA14] A. Putnam, A. Caulfield, E. Chung, D. Chiou, K. Constantinides, J. Demme, H. 
Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, M. Haselman, S. Hauck, S. Heil, A. Hormati, J. Kim, S. 
Lanka, J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong, P. Y. Xiao, and D. Burger: A 
Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services. 
[MICRO00] T. Sherwood, S. Sair, and B. Calder: Predictor-directed Stream Buffers. 
[MICRO11] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki: Toward Dark Silicon in 
Servers. 
[MICRO12] I. Atta, P. Tözün, A. Ailamaki, and A. Moshovos: SLICC: Self-Assembly of Instruction 




i  multicores 
references 
[MICRO13a] C. Kaynak, B. Grot, and B. Falsaﬁ: SHIFT: Shared History Instruction Fetch for Lean-
Core Server Processors. 
[MICRO13b] O. Kocberber, B. Grot, J. Picorel, B. Falsaﬁ, K. Lim, and P. Ranganathan: Meet the 
Walkers: Accelerating Index Traversals for In-memory Databases.  
[MITCMU14] X. Yu, G. Bezerra, A. Pavlo, S. Devadas, and M. Stonebraker: Staring into the Abyss: 
An Evaluation of  Concurrency Control with One Thousand Cores. 
[ORACLE] https://labs.oracle.com/pls/apex/f?p=labs:49:::::P49_PROJECT_ID:14 
[PCS13] B. Vikranth, R. Wankar, and C. Rao: Topology Aware Task Stealing for On-chip NUMA 
Multi-core Processors. 
[PVLDB10a] R. Johnson, I. Pandis, R. Stoica, M. Athanassoulis, and A. Ailamaki: Aether: A Scalable 
Approach to Logging. 
[PVLDB10b] I. Pandis, R. Johnson, N. Hardavellas, and A. Ailamaki: Data-Oriented Transaction 
Execution. 
[PVLDB11a] T. Neumann: Efficiently compiling efficient query plans for modern hardware. 
[PVLDB11b] I. Pandis, P. Tözün, R. Johnson, and A. Ailamaki: PLP: page latch-free shared-
everything OLTP. 
[PVLDB11c] J. Sewall, J. Chhugani, C. Kim, N. Satish, and P. Dubey: PALM: Parallel architecture-
friendly latch-free modifications to B+ trees on many-core processors. 
141 
i  multicores 
references 
[PVLDB12a] M.-C. Albutiu, A. Kemper, and T. Neumann: Massively Parallel Sort-merge Joins in 
Main Memory Multi-core Database Systems. 
[PVLDB12b] G. Giannikis, G. Alonso, and D. Kossmann: SharedDB: Killing One Thousand Queries 
with One Stone. 
[PVLDB12c] P. Larson, S. Blanas, C. Diaconu, C. Freedman, J. M. Patel, and M. Zwilling: High-
performance concurrency control mechanisms for main-memory databases. 
[PVLDB12d] D. Porobic, I. Pandis, M. Branco, P. Tözün, and A. Ailamaki: OLTP on Hardware 
Islands. 
[PVLDB13a] J. Levandoski, D. Lomet, and S. Sengupta: LLAMA: A Cache/Storage Subsystem for 
Modern Hardware 
[PVLDB13b] I. Psaroudakis, M. Athanassoulis, and A. Ailamaki: Sharing Data and Work Across 
Concurrent Analytical Queries. 
[PVLDB13c] K. Ren, A. Thomson, and D. J. Abadi: Lightweight locking for main memory database 
systems. 
[PVLDB14a] C. Balkesen, G. Alonso, J. Teubner, and M. T. Ozsu: Multi-Core, Main-Memory Joins: 
Sort vs. Hash Revisited. 




i  multicores 
references 
[PVLDB14c] M. Karpathiotakis, M. Branco, I. Alagiannis, and A. Ailamaki: Adaptive Query 
Processing on RAW Data. 
[PVLDB14d] Y. Klonatos, C. Koch, T. Rompf, and H. Chafi: Building Efficient Query Engines in a 
High-Level Language. 
[PVLDB14e] S. Pelley, T. Wenisch, B. Gold, and B. Bridge: Storage Management in the NVRAM 
Era. 
[PVLDB14f] T. Wang and  R. Johnson: Scalable Logging through Emerging Non-Volatile Memory. 
[SIGMOD85] G. P. Copeland and S. N. Khoshafian: A Decomposition Storage Model. 
[SIGMOD02a] S. Chen, P. B. Gibbons, T. C. Mowry, and G. Valentin: Fractal prefetching B+-Trees: 
optimizing both cache and disk performance. 
[SIGMOD02b] J. Zhou and K. Ross: Implementing Database Operations Using SIMD Instructions. 
[SIGMOD05] S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki: QPipe: a simultaneously pipelined 
relational query engine. 
[SIGMOD10a] S. Arumugam, A. Dobra, C. Jermaine, N. Pansare, and L. Perez: The DataPath 
system: a data-centric analytic processing engine for large data warehouses. 
[SIGMOD10b] E. P. Jones, D. J. Abadi, and S. Madden: Low overhead concurrency control for 
partitioned main memory databases. 
 
143 
i  multicores 
references 
[SIGMOD12] A. Thomson, T. Diamond, S.-C. Weng, K. Ren, P. Shao, and D. J. Abadi: Calvin: Fast 
distributed transactions for partitioned database systems.  
[SIGMOD13a] C. Diaconu, C. Freedman, E. Ismert, P. Larson, P. Mittal, R. Stonecipher, N. Verma, 
and M. Zwilling: Hekaton: SQL Server’s memory-optimized OLTP engine. 
[SIGMOD13b] G. Graefe, M. Lilibridge, H. Kuno, J. Tucek, and A. Veitch: Controlled Lock Violation. 
[SIGMOD14a] V. Leis, P. Boncz, A. Kemper, T. Neumann: Morsel-Driven Parallelism: A NUMA-
aware Query Evaluation Framework for the Many-Core Age. 
[SIGMOD14b] I. Psaroudakis, M. Athanassoulis, M. Olma, and A. Ailamaki: Reactive and Proactive 
Sharing Across Concurrent Analytical Queries. 
[SOSP13] S. Tu, W. Zheng, E. Kohler, B. Liskov, and S. Madden: Speedy transactions in multicore 
in-memory databases. 
[USENIX11] S. Blagodurov, S. Zhuravlev, M. Dashti, and A. Fedorova: A Case for NUMA-aware 
Contention Management on Multicore Systems. 
[USENIX12] N. Hardavellas: The Rise and Fall of Dark Silicon. 
[VLDB05a] M. Stonebraker, D. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, 
S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik: C-Store: A Column Oriented 
DBMS. 
144 
i  multicores 
references 
[VLDB05b] J. Zhou, J. Cieslewicz, K. Ross, and M. Shah: Improving Database Performance on 
Simultaneous Multithreading Processors. 
[VLDB06] A. Ghoting, G. Buehrer, S. Parthasarathy, D. Kim, A. Nguyen, Y.-K. Chen, and P. Dubey: 
Cache-conscious frequent pattern mining on modern and emerging processors. 
[VLDB07a] R. Johnson, S. Harizopoulos, N. Hardavellas, K. Sabirli, I. Pandis, A. Ailamaki, N. G. 
Mancheril, and B. Falsafi: To share or not to share? 
[VLDB07b] M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, and P. Helland: 
The end of an architectural era: (it’s time for a complete rewrite). 
[VLDB08a] J. Chhugani, A. Nguyen, V. Lee, W. Macy, M. Hagog, Y. Chen, A. Baransi, S. Kumar, and 
P. Dubey: Efficient implementation of sorting on multi-core SIMD CPU architecture. 
[VLDB08b] Lin Qiao, Vijayshankar Raman, Frederick Reiss, Peter J. Haas, and Guy M. Lohman. 
Main-Memory Scan Sharing For Multi-Core CPUs 
[VLDB09a] G. Candea, N. Polyzotis, and R. Vingralek: A scalable, predictable join operator for 
highly concurrent data warehouses. 
[VLDB09b] R. Johnson, I. Pandis, and A. Ailamaki: Improving OLTP Scalability Using Speculative 
Lock Inheritance. 
145 
i  multicores 
references 
[VLDB09c] C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, A. D. Nguyen, N. Satish, J. Chhugani, A. D. 
Blas, and P. Dubey: Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core 
CPUs. 
[VLDB09d] R. Mueller, J. Teubner, and G. Alonso: Data Processing on FPGAs. 
[VLDBJ14] R. Johnson, I. Pandis, and A. Ailamaki: Eliminating unscalable communication in 
transaction processing. 
 
146 
