152 research outputs found
B-tree indexes for high update rates
In some applications, data capture dominates query processing. For example, monitoring moving objects often requires more insertions and updates than queries. Data gathering using automated sensors often exhibits this imbalance. More generally, indexing streams apparently is considered an unsolved problem.
For those applications, B-tree indexes are reasonable choices if some trade-off decisions are tilted towards optimization of updates rather than of queries. This paper surveys techniques that let B-trees sustain very high update rates, up to multiple orders of magnitude higher than tradi-tional B-trees, at the expense of query processing performance. Perhaps not surprisingly, some of these techniques are reminiscent of those employed during index creation, index rebuild, etc., while others are derived from other well known technologies such as differential files and log-structured file systems
Robust and Efficient Sorting with Offset-Value Coding
Sorting and searching are large parts of database query processing, e.g., in
the forms of index creation, index maintenance, and index lookup; and comparing
pairs of keys is a substantial part of the effort in sorting and searching. We
have worked on simple, efficient implementations of decades-old, neglected,
effective techniques for fast comparisons and fast sorting, in particular
offset-value coding. In the process, we happened upon its mutually beneficial
relationship with prefix truncation in run files as well as the duality of
compression techniques in row- and column-format storage structures, namely
prefix truncation and run-length encoding of leading key columns. We also found
a beneficial relationship with consumers of sorted streams, e.g., merging
parallel streams, in-stream aggregation, and merge join. We report on our
implementation in the context of Google's Napa and F1 Query systems as well as
an experimental evaluation of performance and scalability
Sort-based grouping and aggregation
Database query processing requires algorithms for duplicate removal,
grouping, and aggregation. Three algorithms exist: in-stream aggregation is
most efficient by far but requires sorted input; sort-based aggregation relies
on external merge sort; and hash aggregation relies on an in-memory hash table
plus hash partitioning to temporary storage. Cost-based query optimization
chooses which algorithm to use based on several factors including input and
output sizes, the sort order of the input, and the need for sorted output. For
example, hash-based aggregation is ideal for small output (e.g., TPC-H Query
1), whereas sorting the entire input and aggregating after sorting are
preferable when both aggregation input and output are large and the output
needs to be sorted for a subsequent operation such as a merge join.
Unfortunately, the size information required for a sound choice is often
inaccurate or unavailable during query optimization, leading to sub-optimal
algorithm choices. To address this challenge, this paper introduces a new
algorithm for sort-based duplicate removal, grouping, and aggregation. The new
algorithm always performs at least as well as both traditional hash-based and
traditional sort-based algorithms. It can serve as a system's only aggregation
algorithm for unsorted inputs, thus preventing erroneous algorithm choices.
Furthermore, the new algorithm produces sorted output that can speed up
subsequent operations. Google's F1 Query uses the new algorithm in production
workloads that aggregate petabytes of data every day
Recommended from our members
Extensible Query Optimization and Parallel Execution in Volcano ; CU-CS-548-91
Recommended from our members
Heap-Filter Merge Join: A New Algorithm for Joining Medium-Size Inputs ; CU-CS-471-90
Recommended from our members
Volcano, an Extensible and Parallel Query Evaluation System ; CU-CS-481-90
- …