Optimizing Database Architecture for the New Bottleneck: Memory Access by Manegold, S. (Stefan) et al.
The VLDB Journal (2000) 9: 231–246
Optimizing database architecture for the new bottleneck:
memory access
Stefan Manegold1, Peter A. Boncz2,∗, Martin L. Kersten1
1 CWI, Kruislaan 413, 1098 SJ Amsterdam, The Netherlands; E-mail: {S.Manegold,M.L.Kersten}@cwi.nl
2 Data Distilleries B.V., Kruislaan 402, 1098 SMAmsterdam, The Netherlands; E-mail: P.Boncz@ddi.nl
Edited by M.P. Atkinson. Received April 20, 2000 / Accepted June 23, 2000
Abstract. In the past decade, advances in the speed of com-
modity CPUs have far out-paced advances in memory la-
tency. Main-memory access is therefore increasingly a per-
formance bottleneck for many computer applications, includ-
ing database systems. In this article, we use a simple scan
test to show the severe impact of this bottleneck. The insights
gained are translated into guidelines for database architecture,
in terms of both data structures and algorithms. We discuss
how vertically fragmented data structures optimize cache per-
formance on sequential data access. We then focus on equi-
join, typically a random-access operation, and introduce radix
algorithms for partitioned hash-join. The performance of these
algorithms is quantified using a detailed analytical model that
incorporates memory access cost. Experiments that validate
this model were performed on the Monet database system.
We obtained exact statistics on events such as TLBmisses and
L1 and L2 cachemisses by using hardware performance coun-
ters found in modern CPUs. Using our cost model, we show
how the carefully tuned memory access pattern of our radix
algorithms makes them perform well, which is confirmed by
experimental results.
Key words: Main-memory databases – Query processing –
Memory access optimization – Decomposed storage model –
Join algorithms – Implementation techniques
1 Introduction
Custom hardware – from workstations to PCs – has expe-
rienced tremendous improvements in the past decades. Un-
fortunately, this growth has not been equally distributed over
all aspects of hardware performance and capacity. Figure 1
shows that the speed of commercial microprocessors has in-
creased roughly 70% every year, while the speed of commod-
ity DRAM has improved by little more than 50% over the past
decade [Mow94]. Part of the reason for this is that there is a
A preliminary version of this paper has been published as [BMK99].
∗ This work was carried out while the author was at the University
of Amsterdam, supported by SION grant 612-23-431.
Fig. 1. Hardware trends in DRAM and CPU speed
direct tradeoff between capacity and speed in DRAM chips,
and the highest priority has been for increasing capacity. The
result is that from the perspective of the processor, memory
has been getting slower at a dramatic rate. This affects all com-
puter systems, making it increasingly difficult to achieve high
processor efficiencies.
Three aspects of memory performance are of interest:
bandwidth, latency, and address translation. The only way to
reduce effective memory latency for applications has been to
incorporate cache memories in the memory subsystem. Fast
and more expensive SRAM memory chips found their way
to computer boards, to be used as L2 caches. Due to the
ever-rising CPU clock-speeds, the time to bridge the physi-
cal distance between such chips and the CPU became a prob-
lem; so modern CPUs come with an on-chip L1 cache (see
Fig. 2). This physical distance is actually amajor complication
for designs trying to reduce main-memory latency. The new
DRAM standards Rambus [Ram96] and SLDRAM [SLD97]
therefore concentrate on fixing the memory bandwidth bottle-
neck [McC95], rather than the latency problem.
Cache memories can reduce the memory latency only
when the requested data is found in the cache. Their effective-
ness depends on the memory access pattern of the application.
Thus, unless special care is taken,memory latency becomes an
increasing performance bottleneck, preventing applications –
232 S. Manegold et al.: Optimizing database architecture for the new bottleneck: memory access
Fig. 2. Hierarchical memory system
including database systems – from fully exploiting the power
of modern hardware.
Besides memory latency and memory bandwidth, trans-
lation of logical virtual memory addresses to physical page
addresses can also have a severe impact on memory access
performance. The memory management unit (MMU) of all
modern CPUs has a translation lookaside buffer (TLB), a kind
of cache that holds the translation for (typically) the 64 most
recently used pages. If a logical address is found in the TLB,
the translation has no additional cost. Otherwise, a TLB miss
occurs. A TLB miss is handled by trapping to a routine in
the operating system kernel, which translates the address and
places it in the TLB. Depending on the implementation and
hardware architecture, TLB misses can be even more costly
than a main memory access.
For a more detailed discussion of the hardware back-
ground, we refer the interested reader to [MBK00].
1.1 Overview
In this article we investigate the effect of memory access cost
on database performance by looking in detail at the main-
memory cost of typical database applications. Our research
group has studied large main-memory database systems for
the past 10 years. This research started in the PRISMA project
[AvdBF+92], focusingonmassive parallelism, and is nowcen-
tered around Monet [BQK96,BWK98], a high-performance
system targeted to query-intensive application areas like
OLAP and Data Mining. For the research presented here, we
use Monet as our experimentation platform.
The rest of this paper is organized as follows. In Sect. 2, we
analyze the impact of memory access costs on basic database
operations. We show that, unless special care is taken, a data-
base server running even a simple sequential scan on a table
will spend 95% of its cycles waiting for memory to be ac-
cessed. We present a detailed analytical cost model that de-
scribes how hardware characteristics like cache line sizes and
cache miss latencies determine the performance of a sequen-
tial scan. Thismemory access bottleneck is evenmore difficult
to avoid in more complex database operations such as sorting,
aggregation and join, which exhibit a random access pattern.
In Sect. 3, we discuss the consequences of this bottleneck
for data structures and algorithms to be used in database sys-
tems. We identify vertical fragmentation as the solution for
database data structures that leads to optimal memory cache
usage. Concerning query processing algorithms, we focus on
equi-join and introduce new radix algorithms for partitioned
hash-join.
In Sect. 4, we analyze the properties of these algorithms
with detailed analytical cost models that quantify query costs
in terms of CPU cycles, TLB misses, and cache misses. These
models show howmemory access determines the performance
of database algorithms, and they enable us to tune the mem-
ory access pattern of our algorithms carefully to achieve op-
timal performance. Exhaustive experiments using the Monet
system confirm the significant performance improvement that
our memory-conscious algorithms achieve over standard al-
gorithms.
Finally,we evaluate our findings and conclude that the hard
data obtained in our experiments justify the basic architectural
choices of theMonet system, which back in 1992 were mostly
based on intuition.
2 Initial experiment
In this section, we demonstrate the severe impact of mem-
ory access cost on the performance of elementary database
operations. Figure 3 shows results of a simple scan test on
a number of popular workstations of the past decade. In this
test, we sequentially scan an in-memory buffer, by iteratively
reading one byte with a varying stride, i.e., the offset between
two subsequently accessed memory addresses. This experi-
ment mimics what happens if a database server performs a
read-only scan of a one-byte column in an in-memory table
with a certain record-width (the stride); as would happen in
a selection on a column with zero selectivity or in a simple
aggregation (e.g., Max or Sum). The y-axis in Fig. 3 shows
the cost of 200000 iterations in elapsed time, and the x-axis
shows the stride used. We made sure that the buffer was in
memory, but not in any of the memory caches.
2.1 General observations
When the stride is small, successive iterations in the scan read
bytes that are near to each other in memory, hitting the same
cache line. The number of L1 and L2 cache misses is therefore
low. The L1 miss rate reaches its maximum of one miss per
iteration as soon as the stride reaches the size of an L1 cache
line (16 to 32 bytes). Only the L2 miss rate increases further,
until the stride exceeds the size of an L2 cache line (16 to
128 bytes). Then, it is certain that every memory read is a
cache miss. Performance cannot become any worse and stays
constant.
In the following, we first present a detailed analysis of our
experiment in order to understand the impact of various parts
of the hardware system on the performance of (basic) database
operations, such as a sequential in-memory scan. We use an
SGI Origin2000 with MIPS R10000 processors (250 MHz)
as a sample machine, but we keep the models applicable to
other systems as well by using a set of specific parameters to
describe the respective hardware characteristics. Table 1 lists
the parameters for the Origin2000. In [MBK00], we present a
calibration tool to automatically extract these parameters on
any computer hardware. The software is freely available for
S. Manegold et al.: Optimizing database architecture for the new bottleneck: memory access 233
Fig. 3. Reality check: simple in-memory
scan of 200000 tuples
Table 1. Hardware characteristics
Description Value
Machine type SGI Origin2000
OS IRIX64 6.5
CPU MIPS R10000
CPU speed 250 MHz
CPU-inherent parallelism q 4
Main-memory size 48 GB (4 GB local)
L1 cache size ||L1|| 32 KB
L1 cache line size LSL1 32 bytes
L1 cache lines |L1|L1 1024
L2 cache size ||L2|| 4 MB
L2 cache line size LSL2 128 bytes
L2 lines |L2|L2 32768
TLB entries |TLB| 64
Page size ||Pg|| 16 KB
TLB size (|TLB| · ||Pg||) ||TLB|| 1 MB
L1 miss latency lL2 24 ns = 6 cycles
L2 miss latency lMem 400 ns = 100 cycles
TLB miss latency lTLB 228 ns = 57 cycles
Memory bandwidth bwMem 555 MB/s
download from http://www.cwi.nl/∼monet/. There, the cali-
brated results for a large number of hardware platforms are
available, too.
After the detailed analysis, we discuss our scan experiment
in a broader context.
2.2 Detailed analysis
In general, the execution costs per iteration of our experiment
– depending on the stride s – can be modeled in terms of
pure CPU costs (including data accesses in the on-chip L1
cache) and additional costs due toL2 cache accesses andmain-
memory accesses.
Tomeasure the pure CPU costs – i.e., without anymemory
access costs – we reduce the problem size to fit in the L1 cache
and ensure that the table is cached in L1 before running the
experiment. This way, we observed TCPU = 24 ns (6 cycles)
per iteration for our experiment.
We model the costs for accessing data in the L2 cache and
in main memory by scoring each access with the respective la-
tency.As observed above, the number of L2 andmainmemory
accesses (i.e., the number of L1 and L2 misses, respectively)
depends on the access stride. With a stride s smaller than the
cache line sizeLS, the average number of cache misses per it-
eration isM(s) = sLS .With a stride equal to or larger than the
cache line size, a miss occurs with each iteration. In general,
we get (i ∈ {1, 2})
MLi(s) =
{ s
LSLi
, if s < LSLi
1, if s ≥ LSLi
= min
{
s
LSLi
, 1
}
,
withMLi andLSLi (i ∈ {1, 2}) denoting the number of cache
misses and the cache line sizes for each level, respectively.
Figure 4 comparesML1 andML2 to the measured number of
cache misses.
We get the total costs per iteration – depending on the
access stride – by summing the CPU costs, the L2 access
costs, and the main-memory access costs:
T (s) = TCPU + TL2(s) + TMem(s), (“model 1”)
with
TL2(s) = ML1(s) · lL2,
TMem(s) = ML2(s) · lMem,
where lx (x ∈ {L2,Mem}) denote the (cache) memory ac-
cess latencies for each level, respectively. We measure the L2
234 S. Manegold et al.: Optimizing database architecture for the new bottleneck: memory access
1e-03
1e-02
1e-01
1e+00
1e+01
1 2 4 8 16 32 64 128 256 512
e
ve
n
ts
 p
er
 it
er
at
io
n
record width (stride) [byte]
L1 misses
L2 misses
Fig. 4.Measured (points) and modeled (lines) cache misses
and memory latency with our hardware calibration tool pre-
sented in [MBK00].1 Figure 5 shows the resulting curve as
“model 1.”
Obviously, this model does not match the experimental re-
sults. The reason is, that the R10000 processor is super-scalar
and can handle up to q = 4 active operations concurrently.
Thus, the impact of memory access latency on the overall exe-
cution time may be reduced as (a) there must be q unresolved
memory requests before the CPU stalls, and (b) up to q L1 or
L2 cache lines may be loaded in parallel. In other words, op-
erations may (partly) overlap. Consequently, their costs must
not simply be added. Instead, we combine two cost compo-
nents x and y – given the degree o ∈ [0 . . . 1] they overlap –
using the following function:
O(o, x, y) = max{x, y}+ (1− o) ·min{x, y}
= x+ y − o ·min{x, y}.
This overlap function forms a linear interpolation between the
two extreme cases
• no overlap (o = 0) =⇒ O(0, x, y) = x+ y,
• full overlap (o = 1) =⇒ O(1, x, y) = max{x, y}.
Let o1 and o2 be the degrees of overlap for L2 access and
main-memory access, respectively. Then, we get the total cost
considering overlap by applying the overlap function twice:
T ′ = O(o1, TCPU, TL2),
T = O(o2, T ′, TMem).
The following consideration will help us to determine o1 and
o2. In our experiments, we have a pure sequential memory
access pattern. Up to a stride of 8 bytes, 4 subsequent memory
references refer to the same 32-bytes L1 line, i.e., only one
L1 line is loaded at a time, not allowing any overlap of pure
calculation and memory access (o1 = o2 = 0). With strides
between 8 and 32 bytes, o1 linearly increases towards its max-
imum. The same holds for o2 with strides between 32 and 128
bytes, as L2 lines contain 128 bytes on the R10000. Thus, we
get (i ∈ {1, 2})
oi(s) =


0, if s ≤ LSLiq
s− LSLiq
LSLi − LSLiq
, if
LSLi
q
< s < LSLi
1, if s ≥ LSLi .
1e+01
1e+02
1e+03
1 2 4 8 16 32 64 128 256 512
n
a
n
o
se
co
n
ds
 p
er
 it
er
at
io
n
record width (stride) [byte]
experiment
model 1
model 2
model 3
ideal
Fig. 5. Sequential scan: experiment and models
However, we have to consider another limit: bandwidth.
L2 bandwidth is large enough to allow q = 4 concurrent L1
loads within a single L2 latency period, reducing the effective
latency to (1/q)th. Memory bandwidth, however, is limited to
555 MB/s (see [MBK99]). Hence, at least lmin = 220 ns (55
cycles) are needed to load one L2 line (128 bytes).
Now, we can refine our model as follows:
T ′(s) = O
(
o1(s), TCPU, TL2(s)
)
T (s) = O
(
o2(s), T ′(s), TMem(s)
) (“model 2”)
with
TL2(s) = ML1(s)·
(
1− o1(s)
ML1(s)
)
·lL2
TMem(s) = ML2(s)·
(
1− o2(s)
ML2(s)
·(1−lmin)
)
·lMem.
and TCPU, ML1, ML2, lL2, lMem as in “model 1". Figure 5
depicts the resulting curve as “model 2.” This fits the experi-
mental curve pretty well. The differences for small strides can
be eliminated by setting o1 = 1 for all strides, as “model3”
shows in the figure.This in turnmeans that theCPU itself loads
several L1 lines concurrently, even if 4 subsequent memory
references refer to the sameL1 line. In this scenario, the “ideal”
performance of
T (s) = max{TCPU, TL2(s), TMem(s)}, (“ideal”)
i.e., with o1 = o2 = 1, is not reached (see “ideal” in
Fig. 5), because the whole memory bandwidth cannot be uti-
lized automatically for smaller strides, i.e., when severalmem-
ory references refer to a single L2 line.
2.3 Discussion
The detailed analysis and the models derived show how hard-
ware specific parameters such as cache line sizes, cache miss
penalties, and degree of CPU-inherent parallelism determine
the performance of our scan experiment. We will now discuss
the experiment in a broader context.
While all machines in Fig. 3 exhibit the same pattern of
performance degradation with decreasing data locality, Fig.
1 See also http://www.cwi.nl/∼monet/.
S. Manegold et al.: Optimizing database architecture for the new bottleneck: memory access 235
3 clearly shows that the penalty for poor memory cache us-
age has dramatically increased in the last six years. The CPU
speed has improved by almost an order of magnitude, both
through higher clock frequencies and through increased CPU-
inherent parallelism. However, the memory access latencies
have hardly changed. In fact, we must draw the sad conclu-
sion that if no attention is paid in query processing to data
locality all advances in CPU power are neutralized due to the
memory access bottleneck caused by memory latency. The
considerable growth of memory bandwidth – reflected in the
growing cache line sizes 2 – does not solve the latency problem
if data locality is low.
This trend of improvement in bandwidth but standstill in
latency [Ram96,SLD97] is expected to continue, with no real
solutions in sight. The work in [Mow94] has proposed to hide
memory latency behindCPUwork by issuing prefetch instruc-
tions, before data is accessed. The effectiveness of this tech-
nique for database applications is, however, limited due to the
fact that the amount of CPU work per memory access tends
to be small in database operations (e.g., the CPU work in our
select-experiment requires only 4 cycles on the Origin2000).
Another proposal [MKW+98] has been to make the caching
system of a computer configurable, allowing the programmer
to give a “cache-hint” by specifying the memory-access stride
that is going to be used on a region. Only the specified data
would then be fetched, hence optimizing bandwidth usage.
Such a proposal has not yet been considered for custom hard-
ware, however, let alone in OS and compiler tools that would
need to provide the possibility to incorporate such hints for
user programs.
3 Architectural consequences
In the previous sections we have shown that it is less and less
appropriate to think of themainmemory of a computer system
as “random access” memory. In this section, we analyze the
consequences for both data structures and algorithms used in
database systems.
3.1 Data structures
The default physical tuple representation is a consecutive byte
sequence, which must always be accessed by the bottom op-
erators in a query evaluation tree (typically selections or pro-
jections). In the case of sequential scan, we have seen that
performance is strongly determined by the record width (the
position on the x-axis of Fig. 3). This width quickly becomes
too large, and hence performance decreases (e.g., an Item tu-
ple, as shown in Fig. 6, occupies at least 80 bytes on relational
systems). To achieve better performance, a smaller stride is
needed, and for this purpose we recommend using vertically
decomposed data structures.
Monet [BK95,BK99] uses the Decomposed Storage
Model [CK85], storing each column of a relational table in a
separate binary table, called a binary association table (BAT).
2 In one memory fetch, the Origin2000 gets 128 bytes, whereas
the Sun LX gets only 16; an improvement of factor 8.
Fig. 6.Vertically decomposed storage in BATs
A BAT is represented in memory as an array of fixed-size two-
field records [OID,value], or binary units (BUN). Their width
is typically 8 bytes.
In the case of the Origin2000 machine, we deduce from
Fig. 3 that a scan-selection on a table with stride 8 takes 10
CPU cycles per iteration, whereas with a stride of 1 it takes
only 4 cycles. In other words, in a simple range-select, there
is so little CPU work per tuple (4 cycles) that the memory
access cost for a stride of 8 still weighs quite heavily (6 cy-
cles). Therefore we have found it useful inMonet to apply two
space optimizations that further reduce the per-tuple memory
requirements in BATs:
Virtual OIDs: Generally, when decomposing a relational ta-
ble, we get an identical system-generated column of OIDs
in all decomposition BATs, which is dense and ascending
(e.g., 1000, 1001, . . . , 1007). In such BATs, Monet com-
putes the OID values on the fly when they are accessed
using positional lookup of the BUN and avoids allocat-
ing the 4-byte OID field. This is called a “virtual OID” or
VOID column.Apart from reducingmemory requirements
by half, this optimization is also beneficial when joins or
semi-joins are performed on OID columns.3 When one of
the join columns is aVOID, Monet uses positional lookup
instead of, e.g., hash-lookup, effectively eliminating all
join cost.
Byte encodings: Database columns often have a low domain
cardinality. For such columns, Monet uses fixed-size en-
codings in 1- or 2-byte integer values. This simple tech-
nique was chosen because it does not require decoding ef-
fort when the values are used (e.g., a selection on a string
“MAIL” can be re-mapped to a selection on a byte with
value 3). A more complex scheme (e.g., using bit com-
pression) might yield even more memory savings, but the
decoding step required whenever values are accessed can
3 Theprojectionphase in query processing typically leads inMonet
to additional “tuple-reconstruction” joins onOID columns, which are
caused by the fact that tuples are decomposed into multiple BATs.
236 S. Manegold et al.: Optimizing database architecture for the new bottleneck: memory access
03
92
81
47
17
57
03
17
57
81
92
H
H different cluster buffers
1-Pass Cluster
Fig. 7. Straightforward clustering algorithm
quickly become counterproductive due to extra CPU ef-
fort. Even if decoding would only cost a handful of cycles
for each tuple, this would more than double the amount
of CPU effort in simple database operations, such as the
range-select from our experiment.
Figure 6 shows that when applying both techniques; the
storage needed for 1 BUN in the “shipmode” column is re-
duced from 8 bytes to just 1.
3.2 Query processing algorithms
We now briefly discuss the effect of the memory access bottle-
neck on the design of algorithms for commonquery processing
operators.
Selections: If the selectivity is low, most data needs to be vis-
ited, and this is best done with a scan-select (it has optimal
data locality). For higher selectivities, Lehman and Carey
[LC86] concluded that theT-tree and bucket-chained hash-
table were the best data structures for accelerating selec-
tions in main-memory databases. The work in [Ron98]
reports, however, that a B-tree with a block-size equal to
the cache line size is optimal. Our findings about the in-
creased impact of cache misses indeed support this claim,
since lookup using a hash-table or T-tree causes random
memory access to the entire relation, a non-cache-friendly
access pattern.
Grouping and aggregation: Two algorithms are often used
here: sort/merge and hash-grouping. In sort/merge, the ta-
ble is first sorted on the group-by attribute(s) followed by
scanning. Hash-grouping scans the relation once, keeping
a temporary hash-table where the group-by values are a
key that give access to the aggregate totals. This number of
groups is often limited, such that this hash-table fits the L2
cache, and probably also the L1 cache. This makes hash-
grouping superior to sort/merge concerningmain-memory
access, as the sort step has random access behavior and
is performed on the entire relation to be grouped, which
probably does not fit any cache.
Equi-joins: Hash-join has long been the preferred main-
memory join algorithm. It first builds a hash-table on the
(001)
(001)
(011)
(111)
(100)
(001)
(100)
(110)
(000)
(101)
(010)
(001)
57
17
47
03
92
81
20
06
96
37
66
75
1
0
1
(000)
(001)
(001)
(001)
(001)
(010)
(011)
(100)
(100)
(101)
(110)
(111)
96
57
17
81
75
66
03
92
20
37
06
47
1
0
1
0
0
00
01
(001)
(001)
(001)
(000)
(001)
(011)
(010)
(100)
(100)
(101)
(111)
(110)
57
17
81
96
75
03
66
92
20
37
47
06
11
10
2
2
2
2
Pass 1 (2 bits)
4
Pass 2 (1 bit)
Fig. 8. 2-pass/3-bit Radix-cluster (lower bits given in parentheses)
smaller relation (the inner relation). The outer relation is
then scanned, and for each tuple a hash-lookup is per-
formed to find the matching tuples. If this inner relation
plus the hash-table does not fit in any memory cache, a
performance problem occurs, due to the random access
pattern. Merge-join is not a viable alternative as it requires
sorting on both relations first, which would cause random
access over even a larger memory region.
Consequently, we identify join as the most problematic
operator; therefore we investigate possible alternatives that
can get optimal performance out of a hierarchical memory
system.
3.3 Clustered hash-join
Shatdahl et al. [SKN94] showed that a main-memory vari-
ant of Grace Join, in which both relations are first partitioned
on hash-number into H separate clusters, which each fit the
memory cache, performs better than a normal bucket-chained
hash-join. This work employs a straightforward clustering al-
gorithm that simply scans the relation to be clustered once,
inserting each scanned tuple in one of the clusters, as depicted
in Fig. 7. This constitutes a random access pattern that writes
into H separate locations. If H exceeds the number of avail-
able cache lines (L1 or L2), cache trashing occurs; alterna-
tively, ifH exceeds the number of TLB entries, the number of
TLB misses will explode. Both factors will severely degrade
overall join performance.
As an improvement of this straightforward algorithm, we
propose a clustering algorithm that has a cache-friendly mem-
ory access pattern, even for high values of H .
Radix algorithms The radix-cluster algorithm splits a rela-
tion intoH clusters using multiple passes (see Fig. 8). Radix-
clustering on the lowerB bits of the integer hash-value of a col-
umn is done inP sequential passes, inwhich each pass clusters
tuples on Bp bits, starting with the leftmost bits (
∑P
1 Bp =
B). The number of clusters created by the radix-cluster is
H =
∏P
1 Hp, where each pass subdivides each cluster into
Hp = 2Bp newones.When the algorithmstarts, the entire rela-
tion is considered as one cluster and is subdivided inH1 = 2B1
S. Manegold et al.: Optimizing database architecture for the new bottleneck: memory access 237
96
57
17
81
75
66
03
20
06
47
17
(000)
(001)
(001)
(001)
(001)
(010)
(011)
(100)
(100)
(101)
(110)
(111)
32
35
20
47
(000)
(001)
(010)
(100)
(111)
(011)
(000)
96
03 (011)
66
(010)
(010)10
2
92
37
R
L
Fig. 9. Joining 3-bit radix-clustered inputs (black tuples hit)
clusters. The next pass takes these clusters and subdivides each
into H2 = 2B2 new ones, yielding H1 · H2 clusters in total,
etc. Note that with P = 1, the radix-cluster behaves like the
straightforward algorithm.
The interesting property of the radix-cluster is that the
number of randomly accessed regions Hx can be kept low;
while still a high overall number ofH clusters can be achieved
using multiple passes. More specifically, if we keep Hx =
2Bx smaller than the number of cache lines, we avoid cache
trashing altogether.
After radix-clustering a column on B bits, all tuples that
have the same B lowest bits in its column hash-value ap-
pear consecutively in the relation, typically forming chunks
of C/2B tuples. It is therefore not strictly necessary to store
the cluster boundaries in some additional data structure; an al-
gorithm scanning a radix-clustered relation can determine the
cluster boundaries by looking at these lower B “radix bits.”
This allows very fine clusterings without introducing over-
head by large boundary structures. It is interesting to note that
a radix-clustered relation is in fact ordered on radix bits (see
parentheses next to the right-most column in Fig. 8). When
using the algorithm in the partitioned hash-join, we exploit
this property, by performing a merge step on the radix bits of
both radix-clustered relations to obtain the pairs of clusters
that should be hash-joined with each other (see Figs. 9 and
10).
The alternative radix-join algorithm, also proposed here,
makes use of the very fine clustering capabilities of radix-
cluster. If the number of clustersH is high, the radix-clustering
brings the potentially matching tuples near to each other. As
chunk sizes are small, a simple nested loop is then sufficient
to filter out the matching tuples (see Fig. 10). Radix-join is
similar to hash-join in the sense that the number H should
be tuned to be the relation cardinality C divided by a small
constant; just like the length of the bucket-chain in a hash-
table. If this constant gets down to 1, radix-join degenerates
to sort/merge-join, with radix-sort [Knu68] employed in the
sorting phase.
4 Quantitative assessment
The radix-cluster algorithm presented in the previous section
provides three tuning parameters:
partitioned-hashjoin(L, R, H):
radix-cluster(L,H)
radix-cluster(R,H)
FOREACH cluster IN [1 . . . H]
hash-join(L[c], R[c])
radix-join(L, R, H):
radix-cluster(L,H)
radix-cluster(R,H)
FOREACH cluster IN [1 . . . H]
nested-loop(L[c], R[c])
Fig. 10. Join algorithms employed
1. the number of bits used for clustering (B), implying the
number of clusters H = 2B ,
2. the number of passes used during clustering (P ),
3. the number of bits used per clustering pass (Bp).
In the following, we present an exhaustive series of exper-
iments to analyze the performance impact of different settings
of these parameters. After establishing which parameters’ set-
tings are optimal for radix-clustering a relation on B bits, we
turn our attention to the performance of the join algorithms
with varying values of B. Finally, these two experiments are
combined to gain insight in overall join performance.
4.1 Experimental setup
In our experiments, we use binary relations (BATs) of 8-bytes-
wide tuples and varying cardinalities, consisting of uniformly
distributed unique random numbers. In the join experiments,
the join hit rate is one, and the result of a join is a BAT that
contains the [OID,OID] combinations ofmatching tuples (i.e.,
a join-index [Val87]). Subsequent tuple reconstruction is cheap
in Monet, and equal for all algorithms, so as in [SKN94] we
do not include it in our comparison.
The experiments were carried out with an Origin2000 ma-
chine on one 250MHz MIPS R10000 processor. This system
has 32KB of L1 cache, consisting of 1024 lines of 32bytes,
4MB of L2 cache, consisting of 32768 lines of 128 bytes, and
sufficient main memory to hold all data structures. Further,
this system uses a page size of 16KB and has 64 TLB entries.
We used the hardware event counters of the MIPS R10000
CPU [Sil97] to get exact data on the number of cycles, TLB
misses, L1 misses and L2 misses during these experiments.4
Using the data from the experiments, we formulate an ana-
lytical main-memory cost model that quantifies query cost in
terms of these hardware events.
4.2 Radix-cluster
To analyze the impact of all three parameters (B, P , Bp) on
radix-clustering, we conduct two series of experiments, keep-
ing one parameter fixed and varying the remaining two.
4 The Intel Pentium family, SUN UltraSparc, and DECAlpha pro-
vide similar counters.
238 S. Manegold et al.: Optimizing database architecture for the new bottleneck: memory access
First, we conduct experiments with various numbers of
bits and passes, distributing the bits evenly across the passes.
The points in Fig. 11 depict the results for a BAT of 8M tuples
– the remaining cardinalities (15625≤ C ≤ 64M) behave the
same way. The vertical grid lines indicate where the number
of clusters created is equal to the number of TLB entries and
L1 and L2 cache lines – or a power of those – respectively. Up
to 6 bits, using just one pass yields the best performance (in
ms). Then, as the number of clusters to be filled concurrently
exceeds the number of TLB entries (64), the number of TLB
misses increases tremendously, decreasing the performance.
With more than 6 bits, two passes perform better than one.
The costs of an additional pass are more than compensated
by having significantly less TLB misses in each pass using
half the number of bits. Analogously, three passes should be
used with more than 12 bits, and four passes with more than
18 bits. Thus, the number of clusters per pass is limited to
at most the number of TLB entries. A second more moderate
increase in TLB misses occurs when the number of clusters
exceeds the number of L2 cache lines. Then, the additional L2
misses are caused by cache conflicts, forcing modified cache
lines to be written back to memory before they are completely
filled. These write-backs refer to pages whose addresses are
no longer cached in the TLB, yielding an additional TLBmiss
per L2 miss.
Similarly, the number of L1 cache misses and L2 cache
misses significantly increaseswhenever the number of clusters
per pass exceeds the number of L1 cache lines (1024) and L2
cache lines (32768), respectively. The impact of the additional
L2 misses on the total performance is obvious for one pass (it
does not occur with more than one pass, as then at most 13
bits are used per pass). The impact of the additional L1 misses
on the total performance nearly completely vanishes due to
the heavier penalty of TLB misses and L2 misses. Finally,
we notice that the best-case execution time increases with the
number of bits used.
The following model calculates the total execution costs
for a radix-cluster depending on the number of passes, the
number of bits, and the cardinality of the input relation (C =
|Re|):
Tc(P,B,C) = P ·
(
C · wc +ML1,c
(
B
P
,C
)
· lL2
+ML2,c
(
B
P
,C
)
· lMem
+MTLB,c
(
B
P
,C
)
· lTLB
)
with (i ∈ {1, 2})
MLi,c(Bp, C) =
2 · |Re|Li +


C · Hp|Li|Li , if Hp ≤ |Li|Li
C ·
[
1 + log
(
Hp
|Li|Li
)]
, if Hp > |Li|Li
and
1e+07
L1 L12
L1
 m
is
se
s
4 passes
3 passes
2 passes
1 pass    
1e+06
1e+07
L2
L2
 m
is
se
s
1e+04
1e+05
1e+06
1e+07
TLB TLB2 TLB3 TLB
TL
B 
m
is
se
s
1e+03
1e+04
5 10 15 20
m
illi
se
cs
number of bits
Fig. 11. Performance (points) and model (lines) of radix-cluster
(C = 8M)
S. Manegold et al.: Optimizing database architecture for the new bottleneck: memory access 239
MTLB,c(Bp, C) =
2 · |Re|Pg +


|Re|Pg ·
(
Hp
|TLB|
)
, if Hp ≤ |TLB|
C ·
(
1− |TLB|
Hp
)
, if Hp > |TLB|
+


C ·
(
Hp
|L2|L2
)
, if Hp ≤ |L2|L2
C ·
[
1 + log
(
Hp
|L2|L2
)]
, if Hp > |L2|L2
|Re|Li and |Cl|Li denote the number of cache lines per rela-
tion and cluster, respectively, |Re|Pg the number of pages per
relation, |Li|Li the total number of cache lines, both for the
L1 (i = 1) and L2 (i = 2) caches, and |TLB| the number of
TLB entries.
wc denotes the pure CPU costs per tuple. To calibrate
wc, we reduced the cardinality so that all data fits in L1, and
pre-loaded the input relation. Thus, we avoided memory ac-
cess completely. We measured wc = 50ns on the Origin2000
(250MHz).
The first term ofMLi,c equals the minimal number of Li
misses per pass for fetching the input and storing the output.
The second term counts the number of additional Li misses,
when the number of clusters either approaches the number
of available Li lines (Hp ≤ |Li|Li ) or even exceeds this
(Hp > |Li|Li ). First, the probability that the requested cluster
is not in the cache – due to address conflicts – increases until
Hp = |Li|Li . Then, the cache capacity is exhausted, and a
cache miss for each tuple to be assigned to a cluster is certain.
But, with further increasing Hp, the number of cache misses
also increases, as now also the cache lines of the input may
be replaced before all tuples are processed. Thus, each input
cache line has to be loaded more than once.
The first two terms of MTLB,c are made up analogously.
Additionally, using the same schema asML2,c, the third term
models the additional TLB misses that occur due to write-
backs (see above) when the number of clusters either ap-
proaches the number of available L2 lines (Hp ≤ |L2|L2)
or even exceeds this (Hp > |L2|L2).
The lines in Fig. 11 represent our model for a BAT of 8M
tuples. The model shows to be very accurate5. Figures 12 and
13 confirm the accuracy of our model for various cardinali-
ties. In Fig. 12, the optimal number of passes is chosen per
event, showing that our model correctly predicts the behav-
ior of each single event. Figure 13 uses the number of passes
that achieves minimal execution time. Here, the graphs show
that (on the Origin2000) the impact of TLB misses dominates
the execution time. At most 64 clusters should be generated
per pass, although the caches would allow 1024 (L1 cache) or
even 32768 (L2 cache).
The question remaining is how to distribute the number
of radix bits over the passes. We conducted another number
of experiments, using a fixed number of passes, but varying
the number of radix bits per pass. The results showed that an
even distribution of radix bits (i.e., Bi ≈ BP , i ∈ {1, . . . , P})
achieves the best performance.
5 On our Origin2000, we calibrated lTLB = 228ns, lL2 = 24ns,
and lMem = 400ns [MBK99].
4.3 Isolated join performance
We now analyze the impact of the number of radix bits on the
pure join performance, not including the clustering cost.
4.3.1 Radix-Join
The points in Fig. 14 depict the experimental results of radix-
join (L1 and L2 cache misses, TLB misses, elapsed time) for
different cardinalities. The lower graph (in ms) shows that the
performance of radix-join improves with increasing number
of radix bits. The upper graph confirms that only cluster sizes
significantly smaller than L1 size are reasonable. Otherwise,
the number of L1 cachemisses explodes due to cache trashing.
We limited the execution time of each single run to 15 min,
thus using only cluster sizes significantly smaller than L2 size
andTLB size (i.e., number ofTLB entries× page size). That is
why the number of L2 cache misses stay almost constant. The
performance improvement continues until the mean cluster
size is 1 tuple. At that point, radix-join has degenerated to
sort/merge-join.The high cost of radix-joinwith a large cluster
size is explained by the fact that it performs nested-loop join
on each pair of matching clusters. Therefore, clusters need to
be kept small; our results indicate that a cluster size of 8 tuples
is optimal.
The following model calculates the total execution costs
for a radix-join, depending on the number of bits and the car-
dinality6:
Tr(B,C) = C ·
⌈
C
H
⌉
· wr + C · w′r +ML1,r(B,C) · lL2
+ML2,r(B,C) · lMem
+MTLB,r(B,C) · lTLB,
with (i ∈ {1, 2})
MLi,r(B,C) =
3 · |Re|Li +


C · ||Cl||||Li|| , if ||Cl|| ≤ ||Li||
C · |Cl|Li , if ||Cl|| > ||Li||
and
MTLB,r(B,C) =
3 · |Re|Pg +


C · ||Cl||||TLB|| , if ||Cl|| ≤ ||TLB||
C · |Cl|Pg, if ||Cl|| > ||TLB||.
|Re|Pg, |Re|Li , and |Li|Li are as above (i ∈ {1, 2}); ||Cl|| and
|Cl|Pg denote the cluster size in byte and number of pages,
respectively; ||Li|| and ||TLB|| denote (in byte) the size of
both caches (i ∈ {1, 2}) and the memory range covered by
|TLB| pages, respectively.
6 For simplicity of presentation, we assume the cardinalities of
both operands and the result to be the same.
240 S. Manegold et al.: Optimizing database architecture for the new bottleneck: memory access
1e+04
1e+05
1e+06
1e+07
1e+08
L1 L12
L1
 m
iss
es
64000000
8000000
1000000
12500
15625
4 passes
3 passes
2 passes
1 pass    
1e+03
1e+04
1e+05
1e+06
1e+07
1e+08
L2
L2
 m
iss
es
1e+02
1e+03
1e+04
1e+05
1e+06
1e+07
1e+08
TLB TLB2 TLB3 TLB4
TL
B 
m
is
se
s
1e+00
1e+01
1e+02
1e+03
1e+04
1e+05
5 10 15 20
m
illi
se
cs
number of bits
Fig. 12. Performance (points) and model (lines) of radix-cluster (op-
timal number of passes per event)
1e+04
1e+05
1e+06
1e+07
1e+08
L1 L12
L1
 m
iss
es
64000000
8000000
1000000
12500
15625
4 passes
3 passes
2 passes
1 pass    
1e+03
1e+04
1e+05
1e+06
1e+07
1e+08
L2
L2
 m
iss
es
1e+02
1e+03
1e+04
1e+05
1e+06
1e+07
1e+08
TLB TLB2 TLB3 TLB4
TL
B 
m
is
se
s
1e+00
1e+01
1e+02
1e+03
1e+04
1e+05
5 10 15 20
m
illi
se
cs
number of bits
Fig. 13. Performance (points) and model (lines) of radix-cluster
(number of passes for best performance)
S. Manegold et al.: Optimizing database architecture for the new bottleneck: memory access 241
1e+04
1e+05
1e+06
1e+07
1e+08
1e+09
1e+10
L1
 m
is
se
s
clustersize<L1size
64000000
8000000
1000000
125000
15625
1e+04
1e+05
1e+06
1e+07
L2
 m
iss
es
clustersize<L2size
1e+02
1e+03
1e+04
1e+05
TL
B 
m
is
se
s
1e+01
1e+02
1e+03
1e+04
1e+05
1e+06
5 10 15 20 25
m
illi
se
cs
number of bits
clustersize<L1size
clustersize<8tuples
Fig. 14. Performance and model of radix-join
The first term of Tr calculates the costs for evaluating
the join predicate – each tuple of the outer relation has to
be checked against each tuple in the respective cluster; the
cost per check is wr. The second term represents the costs for
creating the result, with w′r denoting the costs per tuple.
The left term of MLi,r equals the minimal number of Li
misses for fetching both operands and storing the result. The
right term counts the number of additional Li misses during
the inner loop, when the cluster size either approaches Li size
(||Cl|| ≤ ||Li||) or even exceeds this (||Cl|| > ||Li||). First,
the probability that the requested tuple is not in the cache
– due to capacity conflicts – increases with growing cluster
size. Then, the cache capacity is exhausted, and a cache miss
for each tuple to be joined is certain. With further increasing
cluster size, the number of cache misses also increases, as
now each iteration of the inner loop also causes a cache miss.
MTLB,r is made up analogously. The lines in Fig. 14 prove the
accuracy of our model for different cardinalities (wr = 24ns,
w′r = 240ns).
4.3.2 Partitioned hash-join
The partitioned hash-join also exhibits increased performance
with increasing number of radix bits. Figure 15 shows that
performance increase flattens after the point where the entire
inner cluster (including its hash table) consists of less pages
than there are TLB entries (64). Then, it also fits the L2 cache
comfortably. Thereafter performance decreases only slightly
until the point that the inner cluster fits the L1 cache. Here,
performance reaches its minimum. The fixed overhead by al-
location of the hash-table structure causes performance to de-
crease when the cluster sizes get too small ( 200 tuples) and
clusters get very numerous.
As for the radix-join, we also provide a cost model for the
partitioned hash-join:
Th(B,C) = C · wh +H · w′h +ML1,h(B,C) · lL2
+ML2,h(B,C) · lMem
+MTLB,h(B,C) · lTLB,
with (i ∈ {1, 2})
MLi,h(B,C) =
3·|Re|Li +


C · ||Cl||||Li|| , if ||Cl|| ≤ ||Li||
C · 10 ·
(
1− ||Li||||Cl||
)
, if ||Cl|| < ||Li||
and
MTLB,h(B,C) =
3·|Re|Pg +


C · ||Cl||||TLB|| , if ||Cl|| ≤ ||TLB||
C · 10 ·
(
1− ||Li||||TLB||
)
, if ||Cl|| > ||TLB||.
|Re|Li, |Re|Pg, ||Cl||, ||Li||, and ||TLB|| are as above.
wh represents the pure calculation costs per tuple, i.e.,
building the hash-table, doing the hash-lookup and creating
242 S. Manegold et al.: Optimizing database architecture for the new bottleneck: memory access
1e+04
1e+05
1e+06
1e+07
1e+08
1e+09
L1
 m
is
se
s
clustersize<L1size
64000000
8000000
1000000
125000
15625
1e+04
1e+05
1e+06
1e+07
1e+08
L2
 m
iss
es
clustersize<L2size
1e+02
1e+03
1e+04
1e+05
1e+06
1e+07
1e+08
1e+09
TL
B 
m
is
se
s
clustersize>TLBsize
1e+01
1e+02
1e+03
1e+04
1e+05
1e+06
5 10 15 20 25
m
illi
se
cs
number of bits
clustersize>L2size clustersize<TLBsize
clustersize<L1size
clustersize<256tuples
Fig. 15. Performance and model of partitioned hash-join
the result. w′h represents the additional costs per cluster for
creating and destroying the hash-table.
The left term of MLi,h equals the minimal number of Li
misses for fetching both operands and storing the result. The
right term counts the number of additional Limisses, when the
cluster size either approaches Li size or even exceeds this. As
soon as the clusters become significantly larger than Li, each
memory access yields a cachemiss due to cache trashing: with
a bucket-chain length of 4, up to 8 memory accesses per tuple
are necessarywhile building the hash-table and performing the
hash-lookup, and another 2 to access the actual tuple.When the
cluster sizes get very small, hash-tables of a fixedminimal size
(256 buckets) need to be allocated and destroyed at increasing
frequency. This causes additional L1 misses, approximately 6
per cluster. Hence, the term H · 6 has to be added to ML1,h.
Again, the number of TLB misses is modeled analogously.
The lines in Fig. 15 represent our model for different car-
dinalities (wh = 680ns, w′h = 3600ns). The predictions are
very accurate.
4.3.3 Improved partitioned hash-join
The model in the previous section shows that our original
implementation of partitioned hash-join suffers from two pa-
rameter settings: the average hash-bucket size of 4 tuples and
the minimal hash-table size of 256 buckets.
Hash-bucket size. Following the linked list within a hash-
bucket (during hash-build and hash-probe) performs a random
memory access pattern with very poor data locality. Hence, up
to 8 additional cache andTLBmisses per tuple occurwith large
clusters (see above). The only way to improve this situation
is to avoid random memory access as much as possible by
reducing the hash-bucket size.Wemodify our implementation
to use perfect hashing, i.e., reducing the targeted hash-bucket
size from 4 tuples to just 1 tuple.
Hash-table size. Another (minor) improvement we make is
that we decrease theminimal hash-table size from 256 buckets
to just 2 buckets.
Figure 16 compares our original version of the partitioned
hash-join as presented in [BMK99] with the improved im-
plementation as presented above. Reducing the hash-bucket
size yields a significant reduction in TLB and cache misses
for large clusters: from 8 to just 3 additional misses per tuple;
in other words, almost a factor of 2 in the total number of
misses. This in turn speeds up the execution time by almost
a factor of 2 for large clusters. Even for small clusters, where
no additional cache or TLB misses occur, the performance
has (slightly) improved, as now less comparisons are neces-
sary during hash-lookups.Additionally, reducing the minimal
hash-table size avoids the increase in L1 cache misses with
very small clusters. But this has hardly any impact on the ex-
ecution time, as the CPU costs for creating and destroying a
large number of tiny hash-tables dominate the performance.
Altogether, our results show that tuning the join phase of a
partitioned hash-join appropriately – in addition to optimizing
the clustering phase as we proposed in [BMK99] – achieves
an additional performance improvement of up to a factor of 2.
S. Manegold et al.: Optimizing database architecture for the new bottleneck: memory access 243
0e+00
1e+07
2e+07
3e+07
4e+07
5e+07
6e+07
7e+07
8e+07
9e+07
L1
 m
iss
es
clustersize<L1size
phash
phash+
 
0e+00
1e+07
2e+07
3e+07
4e+07
5e+07
6e+07
L2
 m
iss
es
clustersize<L2size
0e+00
1e+07
2e+07
3e+07
4e+07
5e+07
6e+07
7e+07
8e+07
TL
B 
m
iss
es
clustersize<TLBsize
0e+00
1e+04
2e+04
3e+04
4e+04
5e+04
5 10 15 20
m
illi
se
cs
number of bits
clustersize<256
Fig. 16. Original vs. improved partitioned hash-join (C = 8M)
4.4 Overall join performance
After having analyzed the impact of the tuning parameters on
the clustering phase and the joining phase separately, we now
turn our attention to the combined cluster and join cost for
both the partitioned hash-join and radix-join. Radix-cluster
gets cheaper for less radix B bits, whereas both radix-join
and partitioned hash-join get more expensive. Putting together
the experimental data we obtained on both cluster and join
performance, we determine the optimum number ofB for the
relation cardinality and the join algorithm.
It turns out that there are four possible strategies, which
correspond to the diagonals in Figs. 15 and 14:
phash(+) L2: (improved) partitioned hash-join on B =
log2(C · 12/||L2||) clustered bits, so the inner relation
plus hash-table fits the L2 cache. This strategy was used
in the work of Shatdahl et al. [SKN94] in their partitioned
hash-join experiments.
phash(+) TLB: (improved) partitioned hash-join on B =
log2(C · 12/||TLB||) clustered bits, so the inner relation
plus hash-table spans at most |TLB| pages. Our exper-
iments show a significant improvement of the pure join
performance between phash L2 and phash TLB.
phash(+) L1: (improved) partitioned hash-join on B =
log2(C · 12/||L1||) clustered bits, so the inner relation
plus hash-table fits the L1 cache. This algorithm usesmore
clustered bits than the previous ones; hence it really needs
the multi-pass radix-cluster algorithm (a straightforward
1-pass cluster would cause cache trashing on this many
clusters).
radix: radix-join on B = log2(C/8) clustered bits. The
radix-join has the most stable performance but has higher
startup cost, as it needs to radix-cluster on significantly
more bits than the other options. It is therefore only a win-
ner with large cardinalities.
Figure 17 compares radix-join (thin lines) and original
partitioned hash-join (thick lines) throughout the whole bit
range, using the corresponding optimal number of passes for
the radix-cluster (see Sect. 4.2). The diagonal lines mark the
setting for B that belong to the four strategies. The optimal
setting for original partitioned hash-join varies between phash
TLB and phash L1. With bigger clusters, the join phase is too
expensive; with smaller clusters, clustering becomes too ex-
pensive. The differences between phash TLB and phash L1
are very small; hence, for simplicity of presentation, we refer
to the optimal setting as “phash TLB.” Similarly, radix-join
yields it best performance somewhere between 16 and 4 tu-
ples per cluster.We refer to the optimal setting as “radix 8.” In
most cases, radix 8 outperforms phash TLB slightly. Figure 18
shows the respective comparison between radix-join and im-
proved partitioned hash-join. Again, phash+ TLB is the opti-
mal setting, but now phash+ TLB slightly outperforms radix
8.
Figure 19 compares the overall performance of all three
join algorithms – radix, phash, and phash+ – for a cardinality
of 8M tuples. The non-logarithmic scale on the y-axis clearly
shows the large improvement in the performance of database
algorithms (an equi-join, in this case) due to memory access
optimization.
Finally, Table 2 compares our radix-cluster-based strate-
gies to non-partitioned (“simple”) hash-join and sort/merge-
244 S. Manegold et al.: Optimizing database architecture for the new bottleneck: memory access
1e+01
1e+02
1e+03
1e+04
1e+05
1e+06
0 5 10 15 20 25
   L2
phash
  TLB
phash
   L1
phash
    8
radix
m
illi
se
cs
number of bits
64000000
16000000
4000000
1000000
250000
62500
15625
 
1 pass    
2 passes
3 passes
4 passes
 
simple hash-join
phash min, radix min
best
 
Fig. 17. Overall performance of radix-join (thin lines) vs. original
partitioned hash-join (thick lines)
join. Our sort/merge-join uses a quick-sort algorithm, which
shows a reasonably good memory access pattern. For larger
cardinalities, it runs up to twice as fast as simple hash-join.
However, on smaller relations the that fit into the L2 cache,
simple hash-join is about twice as fast as sort/merge-join.
The original version of partitioned hash-join (phash TLB) and
radix 8 show very similar performance, running up to almost
6 times faster than simple hash-join. The improved version of
partitionedhash-join (phash+TLB)performs about 10%–20%
faster than radix 8 and phash TLB, being almost a factor of 7
faster than the simple hash-join. This clearly demonstrates that
cache-conscious join algorithms perform significantly better
than the “random-access” algorithms. Here, “cache
conscious” does not only refer to L2 cache, but also to the L1
cache and especially the TLB. Further, Figs. 17 and 18 show
that our radix algorithms improve join performance, both in
the “phash(+) TLB / L1” strategies (cardinalities larger than
250000 require at least two clustering passes) and with the
radix-join itself.
5 Evaluation
In this research, we brought to light the severe impact of mem-
ory access on the performance of elementary database opera-
tions. Hardware trends indicate that this bottleneck has been
present for quite some time; hence our expectation is that its
impact will eventually become deeper than the I/O bottleneck.
1e+01
1e+02
1e+03
1e+04
1e+05
1e+06
0 5 10 15 20 25
   L2
phash+
  TLB
phash+
   L1
phash+
    8
radix
m
illi
se
cs
number of bits
64000000
16000000
4000000
1000000
250000
62500
15625
 
1 pass    
2 passes
3 passes
4 passes
 
simple hash-join+
phash+ min, radix min
best
 
Fig. 18. Overall performance of radix-join (thin lines) vs. improved
partitioned hash-join (thick lines)
Database algorithms and data structures should therefore be
designed and optimized for memory access from the outset.
Sloppy implementation of the key algorithms or “features”
at the innermost level of an operator tree (e.g., pointer swiz-
zling/object table lookup) can lead to a performance disaster
that ever faster CPUs will not be able to rescue.
Conversely, careful design can lead to an order of magni-
tude performance advancement. In our Monet system, under
development since 1992, we have decreased the memory ac-
cess stride using vertical decomposition; a choice that back
in 1992 was mostly based on intuition. The work presented
here now provides strong evidence that this feature is in fact
the basis of good performance. Our simple-scan experiment
demonstrates that decreasing the stride is crucial for optimiz-
ing usage of memory bandwidth.
Concerning query processing algorithms, we have formu-
lated radix algorithms and demonstrated through experimen-
tation that these algorithms form both an addition and an im-
provement to the work in [SKN94]. The modeling work done
to show how these algorithms improve cache behavior dur-
ing join processing represents an important improvement over
previous work on main-memory cost models [LN96,WK90].
Rather than characterizing main-memory performance on the
coarse level of a procedure call with “magical” costs factors
obtained by profiling, our methodology mimics the memory
access pattern of the algorithm to be modeled and then quanti-
fies its cost by counting cachemiss events and CPU cycles.We
were helped in formulating these models through our usage of
S. Manegold et al.: Optimizing database architecture for the new bottleneck: memory access 245
Table 2.Algorithm comparison: absolute performance and relative improvement over simple hash-join and sort/merge-join
Cardinality Simple hash-join Sort/merge-join phash TLB radix 8 phash+ TLB
A (ms) B (ms) A/B C (ms) A/C B/C D (ms) A/D B/D E (ms) A/E B/E
64000000 618527 400074 1.55 104467 5.92 3.83 106504 5.81 3.76 91922 6.73 4.35
32000000 273312 179150 1.53 49579 5.51 3.61 49104 5.57 3.65 43471 6.29 4.12
16000000 117519 55515 2.12 24892 4.72 2.23 25759 4.56 2.16 20972 5.60 2.65
8000000 52652 26639 1.98 11992 4.39 2.22 11547 4.56 2.31 10134 5.20 2.63
4000000 24170 12193 1.98 5552 4.35 2.20 5540 4.36 2.20 5033 4.80 2.42
2000000 11033 5855 1.88 2730 4.04 2.14 2743 4.02 2.13 2064 5.35 2.84
1000000 4849 2743 1.77 1264 3.84 2.17 1277 3.80 2.15 1038 4.67 2.64
500000 1877 1326 1.42 617 3.04 2.15 608 3.09 2.18 498 3.77 2.66
250000 597 613 0.97 301 1.98 2.04 249 2.40 2.46 243 2.46 2.52
125000 149 286 0.52 138 1.08 2.07 100 1.49 2.86 118 1.26 2.42
62500 66 137 0.48 64 1.03 2.14 47 1.40 2.91 48 1.38 2.85
31250 31 65 0.48 31 1.00 2.10 23 1.35 2.83 23 1.35 2.83
15625 15 31 0.48 15 1.00 2.07 11 1.36 2.82 11 1.36 2.82
0
10
20
30
40
50
0 5 10 15 20 25
||L2||    ||TLB|| ||L1|| 256*12 8*8
se
co
n
ds
number of bits
clustersize [bytes]
 
radix
phash
phash+
 
1 pass    
2 passes
3 passes
4 passes
 
simple hash-join
 
local min
 
global min
 
Fig. 19. Overall performance of radix-join, original and improved
partitioned hash-join (C=8M)
hardware event counters present inmodernCPUs.Our detailed
cost models enabled us to identify a significant bottleneck in
the implementation of the partitioned hash-join (following the
bucket-chain during hash-lookups caused too many cache and
TLB misses) and hence to improve the implementation using
perfect hashing.
We think our findings are not only relevant tomain-memo-
ry databases engineers. Vertical fragmentation and memory
access cost have a strong impact on performance of database
systems at a macro level, including those that manage disk-
resident data. Nyberg et al. [NBC+94] stated that techniques
such as software-assisted disk-striping have reduced the I/O
bottleneck, i.e., queries that analyze large relations (like in
OLAP orDataMining) now read their data faster than it can be
processed. We observed this same effect with the Drill Down
Benchmark [BRK98], where a commercial database product
managing disk-resident data was run with a large buffer pool.
While executing almost exclusivelymemory-bound, this prod-
uct was measured to be a factor 40 slower on this benchmark
than the Monet system. After inclusion of cache-optimization
techniques such as described in this paper, we have since been
able to improve our own results on this benchmarkwith almost
an extra order ofmagnitude.This clearly shows the importance
of main-memory access optimization techniques.
In Monet, we use I/O by manipulating virtual memory
mappings and hence treat disk-resident data as memory with
a large granularity. This is in line with the consideration that
disk-resident data is the bottom level of a memory hierarchy
that goes up from the virtual memory, to the main memory
through the cache memories up to the CPU registers (Fig. 2).
Algorithms that are tuned to run well on one level of the mem-
ory also exhibit high performance on the lower levels (e.g.,
radix-join has pure sequential access and consequently also
runs well on virtual memory). As the major performance bot-
tleneck is shifting from I/O to memory access, we therefore
think that main-memory optimization of both data structures
and algorithms – like those described in this paper – will be
increasingly decisive in efficiently exploiting the power of cus-
tom hardware.
6 Conclusion and future work
It was shown that memory access cost is increasingly a bot-
tleneck for database performance.We subsequently discussed
the consequences of this finding on both the data structures
and algorithms employed in database systems.We recommend
using vertical fragmentation in order to better use scarcemem-
ory bandwidth. We introduced new radix algorithms for use
in join processing, and we formulated detailed analytical cost
models that explain why these algorithms make optimal use
of hierarchical memory systems found in modern computer
246 S. Manegold et al.: Optimizing database architecture for the new bottleneck: memory access
hardware. Further, our cost models enabled us to identify the
hash-bucket size as a performance bottleneck for hash-joins in
themainmemory.We showed that using perfect hashing solves
the problem and achieves the best performance, up to twice
as fast as our previous results [BMK99]. Finally, we placed
our results in a broader context of database architecture, and
made recommendations for future systems.
Weonly used one sample architecture (anSGIOrigin2000)
in this study; however, our on-goingwork is to investigate how
both our optimization techniques and our cost models perform
on other hardware platforms. In [MBK00], we present a cali-
bration tool to automatically extract relevant characteristics of
the (cache-) memory system (such as cache sizes, cache line
sizes, and cachemiss latencies) of any computer hardware.We
show that feeding these parameters into the cost models pre-
sented here is sufficient to accurately predict the performance
of database algorithms on different popular computer systems,
such as a SunUltraworkstation, an Intel PentiumIII PC, and an
AMDAthlon PC. This in turn enables us to automatically tune
our memory-conscious algorithms to their optimal settings on
any hardware they run on.
References
[AvdBF+92] Apers P.M.G., van den Berg C.A., Flokstra J., Grefen
P.W.P.J., Kersten M., Wilschut A.N. PRISMA/DB: A
parallel main memory relational DBMS. IEEE Trans-
actions on Knowledge and Data Engineering 4(6):
541–554, December 1992
[BK95] Boncz P., Kersten M. Monet: An impressionist sketch
of an advanced database system. In: Proc. Basque Int.
Workshop on Information Technology, San Sebastian,
Spain, July 1995
[BK99] Boncz P., Kersten M. MIL primitives for querying a
fragmented world. The VLDB Journal 8(2): 101–119,
Oct 1999
[BMK99] Boncz P., Manegold S., Kersten M. Database architec-
ture optimized for the new bottleneck: memory access.
In: Proc. Int. Conf. on Very Large Data Bases, pp. 54–
65, Edinburgh, Scotland, UK, September 1999
[BQK96] Boncz P., Quak W., Kersten M. Monet and its
geographical extensions: a novel approach to high-
performance GIS processing. In: Proc. Int. Conf. on
Extending Database Technology, pp. 147–166, Avi-
gnon, France, June 1996
[BRK98] Boncz P., Ru¨hl T., Kwakkel F. The drill down bench-
mark. In: Proc. Int. Conf. on Very Large Data Bases,
pp. 628–632, NewYork, USA, June 1998
[BWK98] Boncz P., Wilschut A.N., Kersten M. Flattening an
object algebra to provide performance. In: Proc IEEE
Int. Conf. on Data Engineering, pp. 568–577, Orlando,
Fla., USA, February 1998
[CK85] CopelandG.P.,KhoshafianS.Adecomposition storage
model. In: Proc. ACM SIGMOD Int. Conf. on Man-
agement ofData, pp. 268–279,Austin,Tex.,USA,May
1985
[Knu68] Knuth D.E. The art of computer programming, vol. 1.
Addison-Wesley, Reading, Mass., USA, 1968
[LC86] Lehman T.J., Carey M.J. A study of index structures
for main memory database management systems. In:
Proc. Int. Conf. on Very Large Data Bases, pp. 294–
303, Kyoto, Japan, August 1986
[LN96] Listgarten S., Neimat M.-A. Modelling costs for a
MM-DBMS. In: Proc. In.t Workshop on Real-Time
Databases, Issues and Applications, pp. 72–78, New-
port Beach, Calif., USA, March 1996
[MBK99] Manegold S., Boncz P., Kersten M. Optimizing main-
memory join on modern hardware. Technical Report
INS-R9912, CWI, Amsterdam, The Netherlands, Oc-
tober 1999
[MBK00] Manegold S., Boncz P.A., KerstenM.L. What happens
during a join? –DissectingCPUandmemoryoptimiza-
tion effects. In: Proc. Int. Conf. on Very Large Data
Bases, Cairo, Egypt, September 2000, pp. 339–350
[McC95] J.D. McCalpin. Memory bandwidth and machine
balance in current high performance computers.
IEEE Technical Committee on ComputerArchitecture
Newsletter, December 1995
[MKW+98] McKee S., Klenke R., Wright K., Wulf W., Salinas
M., Aylor J., Batson A. Smarter memory: improving
bandwidth for streamed references. IEEE Computer
31(7): 54–63, July 1998
[Mow94] Mowry T.C. Tolerating latency through software-
controlled data prefetching. PhD thesis, Stanford Uni-
versity, Computer Science Department, 1994
[NBC+94] Nyberg C., Barclay T., Cvetanovic Z., Gray J., Lomet
D. AlphaSort: a RISC machine sort. In Proc. ACM
SIGMOD Int. Conf. onManagement of Data, pp. 233–
242, Minneapolis, Minn., USA, May 1994
[Ram96] Rambus Technologies, Inc. Direct rambus technology
disclosure, 1996. http://www.rambus.com/docs/drte-
chov.pdf
[Ron98] Ronstro¨m M. Design and Modeling of a parallel data
server for telecom applications. PhD thesis, Linko¨ping
University, Sweden, 1998
[Sil97] Silicon Graphics, Inc., Mountain View, Calif. Perfor-
mance Tuning and Optimization for Origin2000 and
Onyx2, January 1997
[SKN94] Shatdal A., Kant C., Naughton J. Cache conscious
algorithms for relational query processing. In: Proc.
Int. Conf. on Very Large Data Bases, pp. 510–512,
Santiago, Chile, September 1994
[SLD97] SLDRAM Inc. SyncLink DRAM Whitepaper,
1997. http://www.sldram.com/Documents/SL-
DRAMwhite970910.pdf
[Val87] Valduriez P. Join indices. ACM Transactions on
Database Systems 12(2): 218–246, June 1987
[WK90] Whang K.-Y., Krishnamurthy R. Query Optimization
in a Memory-Resident Domain Relational Calculus
DatabaseSystem.ACMTransactions onDatabaseSys-
tems 15(1): 67–95, March 1990
