Analytic queries enable sophisticated large-scale data analysis within many commercial, scientific and medical domains today. Data skew is a ubiquitous feature of these real-world domains. In a retail database, some products are typically much more popular than others. In a text database, word frequencies follow a Zipf distribution with a small number of very common words, and a long tail of infrequent words. In a geographic database, some regions have much higher populations (and data measurements) than others.
Introduction
In online analytic processing (OLAP) a user executes a collection of complex queries over large data sets, in order to understand the data at hand and to obtain actionable knowledge. With the increasing main-memory capacity of contemporary hardware, query execution can occur entirely in RAM. Analytical query workloads that are typically read-only need no disk access after the initial load. In response to this trend, several commercial and research database management systems have been designed (or re-designed) for memory-resident data [1] . Examples of recent systems include H-Store/VoltDB [2] , Hekaton [3] , HyPer [4] , IBM BLINK [5] , DB2 BLU [6] , SAP HANA [7] , Vectorwise [8] , Oracle TimesTen [9] , Mon-etDB [10] , HYRISE [11] , HIQUE [12] , LegoBase [13] , Peloton [14] , and Quickstep [15] . Most analytic database * This is an extended version of a paper appearing at 36th IEEE International Conference on Data Engineering (ICDE 2020) † Work partly supported by a gift from Oracle Corporation.
(a) Distribution of book titles (b) Access to cache lines (Q1) Figure 1 : An example of data skew in a real dataset.
systems use some variant of columnar storage, since only the columns needed to answer the query need to be read [16] . Skew is a common feature of many real-world domains. Power-law distributions apply to many types of data, including word-usage in text databases, protein interactions, internet routing node-degree, phone-call data, city populations, email contact-lists, surname frequencies, and paper citations [17] . Several other kinds of skewed distributions can also be found [18] . Figure 1(a) shows the cumulative distribution of book reviews in the publicly available Amazon reviews dataset, where books are ranked by the number of reviews. The distribution is likely to be a representative proxy of the actual sales data. As shown, the 100,000 most popular book titles (< 5%) account for roughly 50% of the entire data, and 75% of the sales concentrate on the top 500,000 titles.
Although skew has been recognized as common and important in empirical database deployments, standard benchmarks such as TPC-H and SSB have specified uniform distributions and avoided skew [19] . Even when benchmarks such as TPC-DS adopt skew, they are constrained by query comparability issues to make the data more uniform than it would otherwise be [20] . As a result, there is little research on the impact of skew on analytic query performance.
In the context of a data warehouse, skew is likely to affect query performance. Consider a large fact table corresponding to the sales of a bookseller. One of the columns of this fact table is the id of a book, represented as an integer foreign key bid into a dimension table of detailed book information. Skew as shown in Figure 1 (a) in the bid column would be expected due to different popularity of the books. The following example queries utilize bid in an important way. Query Q1 executes a foreign key join to read or materialize a column from the dimension table. The obtained prices can then be used to calculate revenues. To answer this query, the database system can scan the fact table, and look up the price column of table B using S.bid as an array offset. If the number of products is such that the price column is larger than the cache in size, and S.bids are uniformly distributed in the dimension domain, then this lookup could incur expensive cache misses due to unclustered memory acccesses, which can often be the dominant cost [21] . With skew in S.bid, the situation is somewhat better because the most frequently occurring items are likely to reside in the cache. Nevertheless, the cache is still underutilized because a single cache-resident cache line will typically hold a small number of popular items and many unpopular items. To process this selective filtering, the database system can preprocess the dimension table to determine which products meet the B.price<100 condition. It can then create a bitmap indexed by bid that can be consulted using S.bid as an offset. Q2 has similar cache miss issues to Q1, except that the data being consulted is one bit per bid rather than several bytes. As a result, the product cardinality thresholds for cache-residence will be larger under the same cache capacity.
--Q3: Aggregate grouped by book ids Select bid, count( * ) From Sales Group by bid For this query, the database system can create a table of counts, indexed by bid, and update the count for each corresponding fact table record in turn. The memory access pattern (and cache behavior) is similar to Q1, with the added observation that because cache lines are updated, cache-line replacement triggers some additional memory-write traffic. Query Q3a is similar to Q3 except that we are only interested in the counts of heavy hitters, which for this query means the bids with a count among the top 4,000 book titles. Because the cache footprint of the heavy hitters is much smaller than the entire bid domain, there are opportunities for further performance enhancement if the candidate heavy hitters can be identified in advance.
For these core database operations, skew-aware data management is not adequately considered in existing main memory database architectures. In this paper, we propose a novel index structure called a permutation index for reordering data items by their access frequency (Section 3). Under skewed data distribution, popular data items are concentrated into common cache lines using permutation indexes, leading to improved locality for query processing. By carefully organizing data at the cache line level, we can exploit the data skew for better performance, which has not be utilized by previous research.
The permutation index method is simple, yet very effective at improving the cache utilization. Figure 1(b) illustrates the cumulative frequency of cache line accesses for the example book dataset during query Q1. Under a randomized data layout, the most popular book titles are sparsely distributed, so the distribution of cache line accesses is smoothed out. Even with significant data skew, the most frequent 8,000 cache lines (about the size of L1 cache using 4-byte bids) only correspond to 22% of the fact table data (see the dashed line). With the permutation index, frequent data are concentrated and the percentage increases to 52%. In Section 3.3, we further analyze the improvement in cache hit rates.
Built on permutation indexes, we develop efficient database operators in the presence of skew. Permutation indexes provide frequency information about data items, allowing for threshold-based algorithms that execute different code paths for items with different degrees of skew. To take full advantage of modern architectures, our implementation makes use of single-instruction multiple-data (SIMD) instruction sets, multithreaded execution, and software prefetching (Section 4).
Finally, we present a detailed experimental evaluation of our techniques in Section 5. We conduct experiments on Intel Skylake and Xeon Phi processors, using both synthetic microbenchmarks and real data sets. Our results show that exploiting skew and reordering data can significantly improve performance, making queries using permutation indexes up to an order of magnitude faster.
Background

Data Skew
As introduced in Section 1, a wide variety of real-world phenomena approximately follow skewed distributions [17] , [18] . For database systems, skewed distribution of empirical data has been recoginized in previous studies as a serious challenge to efficient query processing, since databases have to carefully handle parallel query execution and avoid performance degradation due to the long tail effect [22] . Various partitioning algorithms and load balancing strategies have been studied to combat skew [23] , [24] . Different from these studies, we demonstrate that uneven distribution of data can in fact improve performance when the skewed accesses are exploited.
While the techniques described in this paper apply to any skewed distribution, for ease of exposition, we focus on Zipf distributions in our microbenchmark study because they are common and allow one to model the degree of skew with a single parameter z. In a Zipf distribution, the frequency of the data item having rank r is proportional to r −z . z = 0 corresponds to a uniform distribution, while many real-world skewed data sets can be modeled by Zipf distributions with z ≈ 1 [17] , [25] .
Data Representation
For physical storage, we assume a columnar format as is common in analytic databases. An individual column is represented as a dense array of integers. The compact integer representations can be used as data values and/or as foreign key offsets into dimension tables stored as a collection of column arrays. When the true data values are strings, products, or other types of non-numerical data, databases will typically use some form of dictionary coding to map the domain to a contiguous range of integers.
Architectural Issues
We assume a columnar format as is common in analytic databases. An individual column is represented as a dense array of integers. The compact integer representations can be used as data values and/or as foreign key offsets into dimension tables stored as a collection of column arrays. In order to perform well on such RAM-resident data, database management systems must be sensitive to modern computer architectures. We highlight two significant trends in modern architectures that influence implementation choices.
Single Instruction Multiple Data. Modern processors support single-instruction multiple-data (SIMD) instruction sets. These instructions process many data items at a time, enhancing the data-parallelism of algorithms that can be written in a SIMD fashion. Further, SIMD instructions convert control dependencies to data dependencies, helping to eliminate branch misprediction latencies [26] . Current mainstream CPUs (e.g., Skylake [27] ) and Xeon Phi processors (Knights Landing [28] ) now support the 512-bit extensions (AVX-512).
Cache and TLB Performance. Modern architectures provide multiple levels of data and instruction caches. These small but fast memories improve performance when algorithms display spatial or temporal locality. Conversely, algorithms that ignore the caches (e.g., by randomly accessing data structures whose footprint exceeds the cache size) incur many cache misses and/or TLB misses that can reduce performance by an order of magnitude. Addressing cache and TLB performance is a necessity for a database management system given that most data structures for representing the underlying data will be much larger than the data caches. When access patterns are deterministic, prefetching can hide some of the memory access latency. For common patterns such as sequential access, the hardware can automatically prefetch data that is soon needed.
In some cases, these architectural features interact. For example, SIMD gather and scatter instructions are used to access multiple memory locations in a single instruction.
Such instructions are highly efficient on L1-cache-resident data, but perform no better than scalar code on data outside the L1 cache [29] , [30] . This kind of observation argues for synergy. SIMD optimizations enhance the potential of data locality to improve performance, and cache locality optimizations stand to benefit from additional SIMD enhancement.
Cache Performance Optimizations
For the example queries of Section 1, some existing techniques could be used to improve performance by optimizing the cache behavior. One approach is to measure the footprint of the accessed data column to see whether it is larger than the performance-critical cache. If so, a range-partitioning pass over the input bid references (and their fact table payloads for Q1 and Q2) can redistribute the data into fragments. With a sufficient partitioning factor, each fragment will reference a subset of bids that fits into the cache. As long as the partitioning pass is done efficiently with mostly-sequential data accesses and few cache misses [31] , the overhead of partitioning may be smaller than the gain from avoiding cache misses. On the other hand, multiple passes through the very large sales table would be needed, meaning that the overhead is nontrivial even if there is a performance improvement.
An alternative approach is to use software prefetching to overlap the latencies of multiple cache misses [29] . A prefetching distance d is determined empirically, and the bid value for the record d steps ahead of the current record is prefetched into the cache using machine-specific prefetch instructions. The hope is that the cache miss latency is paid while the processor is doing useful work on other items. Modern CPUs can have as many as 10 outstanding memory requests per core, allowing many prefetches to be in flight at once.
While prefetching helps by overlapping latencies, it does not completely eliminate cache miss effects in workloads that are memory-bandwidth bound. In the queries of Section 1, there is limited work that can be done while waiting for the miss to resolve. Therefore, there remains an opportunity for methods that reduce the total volume of data that needs to be brought into the cache from slow memory.
Permutation Indexes
To address the cache utilization problem as outlined in Section 1, we propose a novel index structure that we call a permutation index. We start by identifying a fact table column C with an integer data type. For simplicity we assume that the integer column is an offset into a dimension table as in the example queries. Let |C| denote the number of distinct values appearing in column C.
The permutation index is an array of size |C| containing the values appearing in column C in decreasing frequency order. The left part of Figure 2 show how these integers may be interpreted as offsets into a dimension table (starting from 1). Figure 2 (b) shows an equivalent representation of the same information in which the popularity rank of each distinct data item in C is stored as an additional column of the dimension table. If there are dimension table offsets that do not appear in C, then the representation of Figure 2 (a) will be slightly more compact than Figure 2 (b) because no information needs to be stored for the missing values. Note that the array representation of a permutation index using |C| log |C| bits is already within O(|C|) bits of the information theoretic bound of log(|C|!) [32] . To support efficient inverse operations, we can store a permutation index using compressed data structures for permutations, so that both lookups and reverse-lookups take sublinear time [32] , [33] . Alternatively, we can store both the offset and rank representations, so that lookups in either direction take only constant time.
Index Building and Maintenance
To achieve a better access pattern, we must recode the references in column C into a new version with the permuted references. For the data of Figure 2 , references to dimension table row i are replaced by references to row Rank(i) according to the rank representation in Figure 2 (b) to obtain Figure 2 (d). We assume that this replacement in the fact table has occurred prior to the query execution, typically when the database administrator (or automated physical design optimizer) decided to build the permutation index. Replacement of one integer value with another yields no net change in space for column C. The space cost of building the permutation index is one integer value per dimension table row, and is independent of the size of column C of the (presumably large) referencing column in the fact table.
Building the permutation index can be implemented as an aggregation query (Q3) to obtain the frequency of distinct dimension values, taking linear time, followed by sorting the values by their frequencies. The sorting step takes superlinear time but it executes over the much smaller dimension domain. Then, the identifier replacement in a fact table can be implemented as a materialization query (Q1) using the permutation index we just built, again taking linear time. After this transformation, the old column C (Figure 2 (c)) is dropped, and replaced with a new column C with the new identifiers ( Figure 2(d) ).
Updates to the permutation index can be handled without major reorganization by simply ignoring the effect of fact table insertions/deletions on the frequencies of existing domain values. When a new dimension table row is inserted, it is appended to the end of the dimension table, and its identifier is appended to the end of the permutation index as the new least-frequent item. A small number of updates is unlikely to dramatically change the relative ordering of popular value frequencies from a large fact table, so these simple choices will preserve most of the performance advantages of the structure even after a few updates. After many updates (e.g., a batch update in a data warehouse) the index should be rebuilt.
Overview of Query Processing
The permutation index functions as a reordering template for preprocessing the dimension table during the initial phases of query processing. We create a copy of the needed dimension table columns for a query, permuted according to the corresponding permutation index. For Figure 2 (a), assuming we needed the first column, we would create an intermediate result in the order E,A,H,D,C,G,F,B. Because the cost of the query is likely to be dominated by the large fact table containing the source column C, preprocessing the dimension table in this way will be relatively fast. Note that a dimension table may have several permutation indexes that refer to it from different source columns, each with different orders. If the dimension table is referenced just from a single source, then its reordering can be done entirely ahead of query processing.
Having reordered the data, popular items are adjacent and therefore are likely to share cache lines with other popular items. Therefore, during query processing if we access the data via offsets in the reordered table, we will get much better utilization of all cache levels, particularly when there is skew in an otherwise large domain. As we shall discuss in detail in Section 4, several basic database operations can make use of a permutation index and the reordered data to improve efficiency (in different ways). Since a permutation index simply replaces a fact table column and adds a copy of the dimension column, there is minimal impact on other database queries.
Using permutation indexes over skewed data essentially reorders data to change the memory access pattern during query execution. Physically reordering the dimension table by access frequency would eliminate the need for an explicit permutation index. However, we can impose only one ordering on the dimension table and there may be many competing demands (e.g., ordering by a domain value to support lexicographic comparisions via identifiers, or additional references from fact columns having different skew properties). A similar observation holds for the fact table, where reordering rows might help locality in the ordered columns, but there can be only one order unless data is stored redundantly. In light of these observations, one can think of permutation indexes as a way to influence physical database design. For some skewed data columns, permutation indexes eliminate the need for physical ordering of the fact table by those columns in order to get good cache performance. Physical organization can then focus on the remaining columns that may have more serious nonlocality.
Cache Behavior Analysis
To decide whether to build a permutation index or not, we need to estimate the improvement in cache utilization. A sophisticated query optimizer also requires knowledge of the cache behavior to compute the cost of query execution, together with hardware characteristics. For these purposes, we now study a model for estimating the cache hit rates under different data layouts.
Prior work has developed analytical cache models based on stack distances [34] , [35] . The stack distance is the number of distinct cache lines referred between two references to the same line. For an LRU cache with S lines, accesses with stack distance less than S will be cache hits, while others are misses. To apply this model in our setting, we need an estimation of access frequency to the cache lines, and the stack distance distribution for each line.
Conventional databases store statistics about column distributions to help estimate the selectivities of query conditions. It is also possible to sample the data and fit the samples to known distributions [36] . Most power-law distributions can be modeled with a few parameters, such as the slope and intercept in a log-log plot. Given the value frequency distribution, we can then map it to the cache line distribution for a particular layout. For example, using permutation indexes over 4-byte values, the access frequency of the most frequent 64-byte cache line would be the sum of the 16 largest frequencies in the value distribution. For a randomized mapping from value distribution to cache line distribution, the computed frequency would be an approximation.
Let f i denote the access frequency to cache line L i , where 0 ≤ f i ≤ 1 and i f i = 1. For line L i , we model its next reference as a geometric distribution with probability p = f i . Suppose after k trials, we see cache line L i . According to the stack distance model, if there are fewer than S distinct lines occurred within the k trials, then the reference will be a cache hit. Due to data skew (espectially the very frequent lines), there are potentially repeated occurrences of the same cache lines within the k trials. For every line L j where j = i, its expected number of occurrences is
If n j > 1, then there are (n j − 1) repeated occurrences of L j . Thus, an estimation of the number of distinct lines is
As k increases, d also increases (at a lower rate). For cache line L i , we want to find a threshold K so that after K trials, we see L i and there are estimated d = S distinct lines L j =i seen within the K trials. Then we know for any k < K, we have d < S so the access will be a cache hit. To compute the threshold K, a binary search can be used where K is at least S. Modeled as a geometric distribution, the cumulative distribution (CDF) of (k < K)
Empirically we find this model is accurate with errors less than 5% on our microbenchmarks given that we have a good estimation of the cache line frequencies. In practice, we can also sample the cache line distribution directly to verify the accuracy of the model and adjust accordingly if there is partial clustering, which we plan to address as future work.
Skew-Aware Operator Implementation
We now discuss how to implement efficient database operators that can take advantage of permutation indexes and the reordered data. As introduced in Section 1, we focus on three types of basic database operations: materialization (Q1), selection (Q2), and aggregation (Q3 and Q3a).
The benefits of the proposed permutation index approach are twofold. First, reordering data using permutation indexes improves cache utilization during execution of important database operations (Section 3). Second, the transformed identifiers in the fact table (Figure 2(d) ) provide valuable information about data frequency, which the operators can exploit to perform threshold-based processing (Section 4.2). In this way, the operators are "aware" of the degree of skew, and are able to take appropriate code paths for different actions.
We find that even a straightforward scalar implementation of these operators can take advantage of the improved cache utilization. To further enhance performance, we study the use of a variety of techniques including SIMD vectorization (Section 4.1), multithreaded execution (Section 4.3), and software prefetching (Section 5.1).
Data-Parallel Execution
Data-parallel execution using SIMD instructions has been successful at speeding up various database operations [26] , [30] , [37] , [38] , [39] , especially when the data is cache resident. For data-parallel read and write, we use AVX-512 gather and scatter instructions extensively. For simplicity of presentation, we assume a database column is simply an array of 32-bit integers. Most AVX-512 instructions provide variants for other data types including 8, 16, and 64-bit integers as well.
Materialization. The operator uses the fact table values as indexes to perform gather instructions from the dimension data array. Given 32-bit integer values, an AVX-512 gather instruction retrieves 16 dimension values at once. The latency of this gather operation depends directly on the number of cache misses occurred during its execution (Section 4.2.1).
Selection. This operation produces an array of qualifying fact table offsets. After preprocessing the dimension table using the selection condition to obtain a bitmap, the operator checks each row of the fact table against the bitmap using the referencing identifier as the index, writing out the row offset in the fact table if the bitmap testing succeeds (the scalar implementation needs to be branch-free to avoid the branch misprediction penalty). Whenever the operator needs to test a random bit, reads from the bitmap incur random memory accesses, similar to materialization. For SIMD, we use gather and shift instructions to compute addresses in the bitmap and to extract the bits into a mask. Using the AVX-512 compressed store instruction vpcompressd, we can contiguously store the selected fact table offsets (those with their respective bits set in the mask) into an output array.
Aggregation. This operation generates an array of numeric types to compute an aggregate of some fact table column, grouped by the dimension table offset. The implementation basically scatters into an output array after gathering old aggregates and peforming arithmetic computations. In a scalar implementation, the memory access pattern is similar to materialization with additional writes.
SIMD aggregation, however, needs an additional step to check for conflicts before scattering, since different SIMD lanes may write to the same memory location. When internal conflicts occur, we identify a subset of conflict-free SIMD lanes using an AVX-512 conflict detection instruction vpconflictd, and allow only writes from this subset to succeed using the masked scatter instruction. The other SIMD lanes are retained for processing during the next iteration, alongside new data. If there are many conflicts (e.g., when fact table data is highly skewed), then performance deteriorates severely due to this conflict resolution step. In the worst case where all SIMD lanes attempt to update the same aggregate value, only one (instead of 16 assuming 4-byte integers) can proceed. We shall discuss this problem further in Section 4.2.2. Note that these SIMD implementations do not rely on permutation indexes. Data-parallel optimizations and cache locality optimizations are orthogonal, but they work together to further boost performance.
Threshold-Based Processing
An important insight in the use of permutation indexes is that the transformed identifier in the fact table (see Figure 2(d) ) can be used as a proxy for the value frequency in the table. Given an estimate of the function mapping the numeric identifier to the value frequency, the system can estimate the likely cache behavior at each level using the model in Section 3.3. In general, the most frequently accessed data are likely cached under common cache replacement policies. Given the available cache capacity, it is possible to derive a threshold t such that accesses to data items with identifiers smaller than t are likely cache hits, while identifiers larger than t lead to misses. We can therefore choose different code paths for items based on their anticipated cache residence using a simple comparison between the identifiers and the threshold t.
Materialization.
One situation where such an optimization pays off is when most, but not all data references are cache hits. Consider a materialization query like Q1 in which roughly one in 16 accesses is a cache miss, and the other 15 accesses are cache hits. A profile of the performance of materialization, with data constructed to achieve specific cache miss rates, is shown in Figure 3 . For this profiling, we use 128 million dimension keys and 1 billion fact table tuples (the default microbenchmark setting in Section 5.1). Using single-thread implementations of both scalar and SIMD versions described in Section 4.1, we measured the elapsed time of the query execution on a Skylake processor and a Xeon Phi processor.
On both platforms, there is a big jump from 0 misses to 1 miss per 16 references, and then a less steep increase with additional misses. When there are no misses, the code can run at full capacity with essentially no memory traffic. At one miss, there is a stall on average once for each SIMD instruction. Additional misses have a less dramatic impact because the system can have multiple outstanding misses at any given time. Figure 3 is that the SIMD version has about 2x better performance over the scalar version when data is cache resident for the Phi. When there are many cache misses and the performance is bounded by the memory bandwidth, scalar and SIMD implementations have similar performances. In other words, SIMD optimization is most effective for accessing cache-resident data.
Another observation from
Given the apparent performance difference between inand out-of-cache accesses, it would therefore be advantageous to split a data stream with a low cache miss rate into two pieces: The first (and largest) fragment always hits the cache and runs at cache speeds. The smaller fragment contains the likely misses, but when misses occur they occur together and their latencies can be overlapped.
The implementation for such a splitting method can be done in a data-parallel way. Based on an estimated threshold, a SIMD comparison operator determines which items are likely hits, and their references are resolved immediately. We use masked gather instructions in AVX-512 to directly write materialized data into the output array. For items failing the comparison, the item and its address are written to the tail of a buffer. A second pass then iterates through the buffer to resolve the remaining references. In Section 5.1, we demonstrate that for a materialization operation, the splitting method performs better than the baseline when there are one or two cache misses per 16 accesses.
4.2.2.
Aggregation. The frequency information implicit in transformed identifiers can also be used to improve the performance of aggregation. As we discussed in Section 4.1, SIMD implementations of aggregation perform poorly when the degree of skew is high in data. The reason for this performance degradation is that the most popular data items become so common that conflicts within the SIMD scatter step are frequent. In other words, two or more different SIMD lanes try to update the aggregate value (e.g., count) for the same group. Conflicts take several steps to resolve, and during this resolution process additional conflicting values can be read into the SIMD register, exacerbating the problem.
To mitigate this behavior, one can avoid conflicts on common items by re-mapping accesses to the most frequent items to distinct copies, one per SIMD lane. Identifying whether an item is among the most common is simple: it has an identifier that is no more than some threshold t. Because of data skew, remapping just a few of the most frequent items can already be quite effective in reducing conflicts. For example, in our microbenchmark experiments (Section 5.1), remapping the 40 most common 4-byte values was a good choice. In the modified aggregation algorithm, we keep 16 copies of each of the top 40 values. Accesses to these common values never conflict because the copy used is determined by the SIMD lane. The remaining items have only one copy, as before. The copies can be stored immediately before the original data array, preserving data contiguity. For SIMD lane i (where i = 0 . . . 15), if the identifier loaded for this lane is not greater than t, then the copy at offset (−i * t) is updated.
Using SIMD instructions, the remapping can be simply implemented as masked arithmetic computations based on the comparison with the threshold t. The top-t locations in the original array now store only the aggregates from a single SIMD lane. When t = 40, the cache footprint is bigger by 40 * 15 * 4 = 2400 bytes, which is small relative to typical cache sizes and thus of minor impact. In the end, combining the copies for all SIMD lanes produces the final aggregates. In our experimental evaluations, we find this copying method effectively reduces conflicts in case of high skew, drastically improving performance.
Heavy hitter aggregation. The permutation index has direct benefits for queries like Q3a that compute counts for just the most frequent elements (i.e., heavy hitters). The most common 4,000 values for Q3a are simply the transformed identifiers from 1 to 4,000, assuming that the permutation index is up-to-date. We can directly aggregate these values and ignore the rest (using the limit value as a cut-off threshold), unlike conventional methods that would need to compute the exact counts for all bid values. The cache footprint of this approach is much smaller. For 4byte integers, the output array is entirely L1-cache resident. Since there are almost no out-of-cache memory accesses, the performance of heavy hitter aggregation is significantly better than a full aggregation.
Intra-Query Parallelization
A modern CPU contains many cores, each capable of independent work. It is therefore essential that this available parallelism is exploited by implementing multi-threaded versions of query processing algorithms. Fortunately, materialization (Q1) and selection (Q2) operations are relatively easy to parallelize. We can partition the input fact data into non-overlapping chunks so that different threads can work on different chunks independently. Shared reads (e.g., from a common dimension array) typically do not cause performance issue since they are cached across cores.
Aggregation (Q3) is trickier because independent threads may try to update the count for the same bid. Similar to the SIMD conflict detection, implementations must guarantee the atomicity of potentially conflicting updates to the shared result array, which may induce a performance overhead. Data skew potentially exacerbates the problem by increasing the probability that conflicting updates occur.
On the other hand, to fully avoid conflicts and the overhead, each thread can have its own independent copy of the result array, so that threads do not concurrently update the same memory location. The obvious disadvantage is memory consumption, which is multiplied by the number of threads used. Another cost is the overhead of the final aggregation that combines partial results from all threads, which gradually becomes non-negligible with more threads used.
In case of data skew, we are able to get the best of both approaches. We use a hybrid approach to allow different threads access to their own private versions of the hot data, while using atomic operations on shared representations of the less frequent data. Such an approach reduces conflicts while remaining space-efficient, similar to recent approaches to parallel aggregation [40] , [41] , [42] . Given the available memory space and the frequency estimator, it is possible to derive (or obtain experimentally) a threshold for distinguishing the frequent data based on the transformed identifiers.
As we demonstrate in our experiments, a small threshold is often enough to speed up queries in the case of skewed data. For example, typically each thread can use a private buffer for the most frequent 8, 192 data items so that the buffer fits in the L1 cache, using four-byte identifiers. A limitation of the current SIMD extensions is that they lack support for vector atomic instructions. Introduction of such instructions could significantly improve performance [43] .
Experiments
We conducted experiments on an Intel Skylake CPU and on an Intel Xeon Phi (Knights Landing) processor, both of which support AVX-512. The Xeon Phi does not have a L3 cache, and its 1MB L2 cache is shared between two adjacent cores. Table 1 summarizes the details of our experimental platforms. Both machines are configured to use 2MB hugepages to avoid the TLB thrashing problem discussed in Section 2.3. 1GB hugepages are also available and result in similar performance. Both hugepage options lead to much better performance than using the regular 4KB pages.
Our code was compiled using GCC 7.3 with -O3 optimization and loop-unrolling enabled, and ran on 64bit Linux operating systems. Performance counters such as cache misses and branch misses were obtained using the perf events interface. We employed thread pinning to avoid undesired thread migration, maximizing the utilization of private cache and local memory on multicore NUMA platforms.
Microbenchmark Results
We have implemented microbenchmarks to evaluate the potential performance improvements enabled by the permutation index. We simulate skew in empirical data by generating zipf distributions with varying zipf parameters. For a particular zipf factor, we vary the number of keys in a dimension table (i.e., the cardinality of the domain). We fix the column cardinality to 1 billion tuples in a fact table, and assign key values (4-byte integers) from the dimension domain to the fact table column under the chosen zipf distribution, in random order. We also experiment with varying sizes of the dimension domain.
For the baseline approach, the tuple identifiers in a dimension table are randomized (denoted as "rand" in the figures). For the permutation index approach, the dimension column is sorted by frequency and the referencing column in a fact table is preprocessed to hold transformed identifiers (denoted as "freq"). Using both scalar and SIMD (AVX-512) implementations, we measure the elapsed time for database operations discussed in Section 4. For a particular operation, all tuples in a column are executed in a tight loop. We also evaluate the approaches both with and without software prefetching of the data references. Figure 4 shows the performance of materialization (Q1) using a SIMD implementation, which was faster than the scalar implementation for this query. Figure 4(a) shows that at z = 1.0, a common distributional parameter in practice, permutation indexes speed up the query by more than 20% on both architectures. With the number of keys fixed at 128 million, Figure 4 (b) shows that performance improvements occur over a range of z values. For smaller z values, the data is close to uniform and both methods suffer cache misses. For larger z values, the skew is so concentrated in a handful of data values that both methods enjoy cache hits most of the time. Without software prefetching, the speedup at z = 1 and 128 million keys is 1.3x on the Skylake and 1.8x on the Phi. Figure 4 (c) demonstrates that the time improvements on the Phi are indeed due to cache behavior. Half of the data loads are to the array of probes (almost always hits due to clustered sequential access and hardware prefetching), while half are to the accessed data in the dimension array. Thus, at small z values we see a 50% load hit rate in the L1 cache, and also a 50% miss rate for the L2 cache. The dimension table footprint is much bigger than the L2 cache, making L2 hits that are L1 misses insignificant at small z values. As z increases, the L1 hit rate of the permutation index method is better (higher) than the randomized method, while the L2 miss rate corresponding to the access frequency from RAM is also better (lower). In addition, the permutation index method has much better data TLB utilization at 128 million keys. For example, at z = 1.0, the DLTB miss rate is 2.6% using permutation indexes, while the miss rate of the baseline method is 20%.
As discussed in Section 2.4, it is often possible to utilize out-of-order execution and software prefetching to hide the latency of cache misses. Figures 4(a) and 4(b) also show that the benefits of permutation indexes are still present even when prefetching is used. For example, at z = 1.0 and 128 million keys, using permutation indexes on top of prefetching makes the query 3x faster on the Phi. For these results, we chose the prefetch distances empirically. Figure 4(d) shows performance with varying prefetch distances. Since Skylake does not support AVX-512 PF extensions, we simulate the vpgatherpf instruction with a loop of 16 scalar prefetch instructions. As shown in the figure, the best prefetch distance on the Phi is 3(x16) for the baseline and 6(x16) for the permutation index method. As we prefetch more, the baseline approach quickly suffers due to cache thrashing, while the permutation index approach is almost unaffected. In general, we find that permutation indexes enable longer prefetch distances comparing with the baseline, making our approach even faster. On the Skylake, it turns out that the prefetch overhead is more than the benefits from prefetching, so we did not show its prefetch performance in Figures 4(a) and 4(b) .
In the case of high skew, Figure 4 (e) presents a zoomedin view of the SIMD performance (the red and green curves) from Figure 3 for 0-6 cache misses, together with the performance of the threshold-based splitting methods described in Section 4.2.1. Without splitting, there is a big difference of 2.1-3.1x from no cache misses at all to just a single miss, while subsequent misses cause much less latency. When the number of cache misses is small (1 or 2 per 16 elements), the threshold-based implementation splitting in-and out-ofcache accesses is better than the baseline. With more misses, the benefits of this optimization are gradually outweighed by the overhead of writing out intermediate buffers. Figure 4 (f) shows the performance of multithreaded materialization. We vary the number of threads from 1 to 256 on the Phi, and to 48 on the Skylake. Since the performance scales almost linearly up to 16 threads, we omitted the results with fewer than 8 threads used. Even though threads are sharing cache resources, the permutation index method always performs better, achieving a speedup of over 1.3x on both platforms with varying threads. On the Phi, using more than 64 threads does not provide any more benefits than using the baseline approach, and prefetching makes the query even slower. In contrast, the permutation index approach still slightly improves performance. Figure 5 shows the performance of selection (Q2) using the SIMD implementation, which was again faster than the scalar implementation for this query. We omit the results without prefetching on the Phi since they are slower. The performance profile is similar to Figure 4 , except that for low key-counts the performance is better. Because only one bit (rather than 32 bits) is needed per key, a larger number of keys is required before the cache capacity is exceeded. The relative performance of the Skylake processor is better in Figure 5 than in Figure 4 because it has a large L3 cache that can hold the entire bitmap. Figure 5(c) shows the multithreaded performance. Comparing with the baseline, the permutation index method is 1.4-2.2x faster on the Phi, and 1.3-1.8x faster on the Skylake. Figure 6 shows the performance of aggregation (Q3) using the scalar and SIMD implementations. On Skylake, scalar code generally outperforms the SIMD code because the SIMD code has the overhead of conflict detection and resolution. In fact, for high z values, the impact of conflict resolution creates a severe degration in performance. The threshold-based copying method described in Section 4.2.2 addresses this issue. On Skylake, the scalar version with our permutation index optimization is fastest for all z values in Figure 6 (b). The modified SIMD algorithm with copying is the best-performing method at high skew levels on the Phi (Figure 6(e) ), but not on the Skylake machine. Figure 6 also shows the performance of heavy hitter aggregation for computing the counts of the 4,000 most frequent data items ("top-4k"). Our optimization discussed in Section 4.2.2 significantly improves query performance, since the threshold-based approach only updates cacheresident data. In comparison, performing full aggregates without using the permutation index would have much larger memory footprint and have to do extra work to extract the top-4,000 results. As an example, for z = 0.5, the performance improvement over computing the entire aggregate is 8.6x for the Phi, and 6.6x for the Skylake processor.
Note that we report the performance results for all operations using branch-free implementations, with one exception: for the heavy hitter aggregation on the Phi, we use an explicit if-branch to check whether the identifiers are smaller than 4,000, since we find the scalar branchfree version is particularly slow on the Phi. As shown in Figure 6 (e), when there are no heavy hitters (z = 0.0), or when the data is very skewed (z = 2.0), branching does not hurt the performance of heavy hitter aggregation. At high skew, the latency is longer because of more updates. When z is close to 1.0, performance suffers due to branch misprediction penalty. For example, at z = 1.0 the branch misprediction rate is 45.5%. In contrast, Figure 6 6(c) and 6(f) present the aggregation performance using all 48 threads on the Skylake, and 256 threads on the Phi, respectively. Single-threaded results are shown as a reference. We compare the performance of independentbuffer implementations, shared buffer implementations using atomic operations, and the threshold-based hybrid ap-proach described in Section 4.3. For independent-buffer implementations, the permutation index method is better than the baseline, but both methods do not scale well because of the increased contention and the overhead of final combination when all threads are used. On the Phi, they are the slowest methods when all threads are used, despite using much more memory. For the shared-buffer implementations, the permutation index method performs worse with multithreading because it leads to more severe contention over popular cache lines for atomic operations. The hybrid method addresses the contention problem. As shown in the figure, the hybrid methods perform the best. Using 256K as the threshold results in improved latency comparing to the 8K threshold (these threshold values correspond to L1 and L2 cache sizes), but uses more memory. Comparing the independent-buffer baseline approach, this method is 4.7x faster on the Skylake and 10.4x faster on the Phi, with much smaller memory usage.
To estimate the cost of building a permutation index, one can add the cost of an aggregation query (Q3) and a materialization query (Q1) using the baseline (rand) organization. In every configuration described above, building the index using all available threads would take just a few seconds. 
Results on Real Datasets
We tested the permutation index method on three types of real-world datasets with different sizes and degrees of skew:
• Pageview 1 is a dataset of web request logs of wikipedia pages. The data is cleaned to filter out requests from search engine spiders, leaving only human traffic. In a database system, the request logs are stored as a fact table referencing a dimension table describing page information. For our experiment, we used the monthly request data from December 2018.
• Product 2 is the Amazon reviews data. We believe the number of public product reviews is a proxy of the sales data, and it also exhibits skew. We consider the products as dimension data, and each review is a data item in the fact table. The data is separated into different categories, and each category exhibits different skewed data distribution. Our experiments uses eight categories with the most products.
• Graph 3 . We tested several large graphs from the Stanford Large Network Dataset Collection [44] , including social, communication, citation, and road networks. In our experiments, edges are stored as a fact table, with nodes stored in a dimension table.
As an example, in a social network, nodes represent users and the dimension table describes user information. An aggregation query on the fact table, for instance, is to count the number of friends for every user (i.e., node degrees). Table 2 summarizes the sizes of the all the real-world datasets used in our experiments.
In the following experiments, we report the results of SIMD implementations for materialization and selection, and scalar implementations for aggregation (and heavy hitter aggregation). These implementations are generally fast as revealed in our microbenchmark analysis. For heavy hitter aggregation, we again compute the counts for the top 4,000 most frequent data items. Figure 7 shows the performance improvements on the largest dataset Pageview, using a varying number of threads. On the Skylake with a single thread, the permutation index speedup of materialization, selection, full aggregation, and heavy hitter aggregation is 1.5x, 2.2x, 1.3x, and 4.0x, respectively. The results on the Phi are generally slower than on the Skylake, but using permutation indexes similarly achieves 1.10-2.27x performance using a single thread. With more threads, permutation indexes speed up queries similarly for materialization, selection, and full aggregation, up to 2x. When all threads are used, the performance ratio for heavy hitter aggregation is 20.4x on the Skylake and 22.3x on the Phi. Because all threads can use their cache-resident private buffer to execute heavy hitter aggregation in parallel, the permutation index method is particularly effective. Tables 3 and 4 present the single-threaded latency results (in milliseconds) and performance speedups of using permutation indexes on all other datasets. On the Skylake for the product review and graph data, the performance improvements are up to 6.0x for heavy hitter aggregations, and up to 1.4-1.5x for other operations. Results on the Phi are generally slower, but they achieve similar speedups using permutation indexes.
Related Work
Skew has been used to improve OLAP query execution in several ways that are orthogonal to those described in this paper. High-frequency data items can be represented with fewer bits to reduce data transfer costs [45] , [46] . When skew causes load imbalances in a parallel join, explicit scheduling of common keys can overcome this imbalance [23] , [47] , [48] . At lower levels of the memory hierarchy, popular rows may be surrounded by cold rows in a disk page. This observation has led some systems to adopt row caches that keep popular rows in RAM even if their page has been evicted from the buffer pool [49] . Such a solution is not viable at higher levels of the memory hierarchy because the programmer has little direct control of what data is loaded into the processor caches.
Techniques for tracing data access streams to quantify the degree of latent locality have been described [50] . Distributional parameters of skewed data streams can be estimated [51] . Skew can affect the performance of selectivity estimators in query optimization, and skew-aware methods are required for accurate estimation [52] .
The advent of large main memory capacity paved the way for high-performance OLAP query execution. Columnoriented execution [53] and cache-conscious operators [54] were proposed before the advent of multi-core CPUs. Analytical database systems adopted column-oriented storage, while focusing on compression [6] , [38] , [55] and complex materialization strategies [56] , [57] to further optimize memory access. Block-at-a-time execution [58] and code generation [59] , [60] , [61] , [62] are both state-of-the-art designs for analytical query engines [63] . SIMD optimizations have been applied to isolated operators using key-rid pairs. Many join implementations exist [21] , [37] , [64] , [65] , [66] , including for many-core platforms [67] . SIMD implementations of stand-alone operators such as sorting [31] , [39] , [68] , [69] are also common. Linear-access operators such as scans [26] and compression [38] , [70] , [71] , are inherently data-parallel. Advanced SIMD optimizations [30] , [72] include non-linear-access operators. A recently proposed vector algebra [73] proposes templates for both SIMD CPUs and SIMT GPUs. Nonlinear-access operators are wrappers to vector gathers, such as joins executed by dereferencing join indexes, which are likely to benefit from the permutation indexes when there is skew in the join attribute distribution.
Other architecture-specific optimizations such as software prefetching and query compilation are used to improve database performance [29] . Cache partitioning is used to address the cache pollution problem in shared cache to accelerate concurrent workloads [74] . The performance of frequent pattern mining algorithms can also be improved by tuning data layout and access patterns [75] .
Conclusions
We propose permutation indexes to reorder data so that popular data items are concentrated in the cache hierarchy. Using this method, we can exploit the data skew inherent in many practical domains to improve cache utilization at all levels. Efficient database operators can be developed using permutation indexes, combining the benefits of cache optimization with SIMD vectorization. Our method is also effective when working together with multithreaded execution and software prefetching, further speeding up query execution. Through extensive experiments with real and synthetic data, we demonstrate that the performance of materialization, selection, and aggregation queries can be significantly improved.
