Index data structures are an important component of database applications. Our measurements show that for a memory resident CSB+-tree (Cache Sensitive B+ tree), the cache miss penalty now accounts for over 50% of the total running time. This is exacerbated by current trends. As a result of Moore's law, both the CPU speed and the DRAM density have doubled every 18 months. However the cache miss penalty for the CPU has increased with today's longer cache lines.
Introduction
Processor speed has increased exponentially in recent years. However the memory speed has increased only moderately. Hence, the memory access time has become the significant bottleneck for many com-
Our Approach: Cooperative CPU caching
This paper introduces and evaluates cooperative caching over L2 cache in a cluster to query over a memory-resident index structures. This new solution exhibits performance superior to two existing solutions. Superlinear speedup is achieved by totally eliminating unnecessary accesses to main memory. For a given sorted array, the elements in the sorted array are evenly partitioned across a set of cluster nodes. The size of each partition is equal to or smaller than the size of the L2 cache of a node in the cluster. Each node in the cluster that contains part of the sorted array is called a slave. One node in the cluster acts as the master and distributes the keys in stream arriving queries to appropriate slaves. On the slave side, three alternative strategies to query over a cache-resident sorted array are proposed and evaluated.
The contributions of our work are:
1. This is the first time that cooperative CPU caching in a cluster is proposed. Section 3 defines cooperative CPU caching, and its relation to the more traditional cooperative caching. Section 4 describes our algorithm for cooperative CPU caching. In Section 6, on our cluster testbed, we demonstrate that accessing a remote node's L2 cache is faster than accessing local memory, when using today's high-speed network. The key to conquer the memory wall with cooperative L2 caching is replacing several random local memory accesses with one remote L2 cache accesses. One also amortizes the network delay by batching many queries in a message of size at least 12 KB.
With query over a memory resident sorted array as a demonstration, the new query method with cooperative CPU caching achieves superlinear speedup in a Linux cluster.
2. In Section 5, a more accurate cache performance model is given for our case of interest: lookup in main memory B+ trees or n-ary trees. We explore the problem further, by concentrating on the modeling of a main memory B+ tree or n-ary tree, in which the cache is small compared to the tree size. We consider both capacity cache misses and compulsory cache misses for an exponential distribution of access rates for nodes in a B+ tree. A comparison of the model with experimental results is given in Section 6.1. This model can be generalized to any tree traversal problem in which the access rates to the nodes in the tree are exponentially distributed. A comparison with other researchers is provided in Section 1.2.
3. Three alternative strategies for cooperative CPU caching are proposed and evaluated in Section 5.2.3. They are experimentally tested in Section 6. This is the first time to evaluate query strategies over a cache-resident sorted array. The experimental results show that binary search over a cache-resident sorted array is faster than indexing strategies. Even with both the sorted array and the index fitting in the L2 cache, binary query is still superior. When the data can fit in the L2 cache, the CPU computation dominates the cost. Hence, the strategy with less CPU computation performs the best for cache-resident applications.
4. In Section 6.3, our new indexing method with cooperative CPU is evaluated and compared with other exiting indexing methods both theoretically and practically. Our new indexing method provides both a faster response time and a larger throughput. In most scenarios superlinear speedup is achieved. because cooperative CPU caching avoids most of the L2 cache misses and TLB misses produced by other indexing methods.
Related Work
We noticed a few related publications on topic of memory resident B+ tree such as [3, 9, 10, 11, 15, 16, 18, 19] . B+ tree was firstly proposed to optimize disk performance with node size equal to the disk page size. Recently researchers [9, 10, 11] have found that for memory-resident B+ tree, to improve cache line utilization, the node size of the B+ tree should be one or a few cache lines. With the smaller node size, the branch factor becomes smaller and the B+ tree's height becomes larger. Rao [18] proposed CSB+ tree (cache sensitive B+ tree). In a CSB+ tree, branching factor is improved by storing only the first child pointer at each node. Other child pointers can be calculated by adding the offset to the first child pointer because all child nodes are stored consecutively in the memory space in a CSB+ tree. Zhou [19] proposed buffering accesses technique to improve the cache performance for a bulk lookup. However, cache miss penalty still accounts over 30% of the total cost for each query in all above proposed methods. Furthermore, none of them exploit how to access an index in a cluster environment to provide both higher throughput and faster response time.
Several researchers have studied the cache performance for tree traversal problems or B+ tree lookups. Ladner et al. [14] assume the nodes in the tree are accessed uniformly. This model is not accurate for the B+ tree lookup problem, because the access rates for the nodes in the B+ tree are exponentially distributed. Hankins and Patel [11] assume an exponential distribution for the access rates of nodes in a B+ tree. However, they only considered the compulsory cache misses, and not the capacity cache misses. They assume that the B+ tree can fit in the L2 cache.
Review of Memory Hierarchy
The gap between the main memory speed and the CPU speed is increasing. Caches are used between memory and CPU to alleviate the gap. Caches are small, fast static RAM memories. Caches typically run at the same clock rate as the CPU.
There could be several levels of caches between the CPU and the main memory. The level of a cache is defined according to the distance from the CPU. The closest is called the L1 cache, the second closest is called the L2 cache, and so on. The closer to the CPU, the faster and smaller the cache is. For PCs, they typically have two levels of cache. For servers, they typically have three levels of cache. The size of cache ranges from 16 KB to 128 MB. At each level, cache can be categorized into the instruction cache and the data cache. In our study, we only consider the data cache performance, but not the instruction cache performance.
With cache containing the recently accessed data, program performance can be improved if data are accessed data with temporal locality. If the data to be accessed is in the cache, called cache hits, the processor will receive data from cache instantly. If the data to be accessed is not in the cache, called cache misses, the processor will be blocked until the cache line that contains the requested data is transfered from memory to cache. A user can't directly manage the caches. The hardware decides which cache lines to load from memory to cache. In modern architectures, with aid of prefetch instructions or a software prefetch package, one can manage the cache with some limitations.
Cache misses fall into three cases:
1. Compulsory Cache Misses: cache misses happen when the data is accessed at the first time. Compulsory cache misses can be avoid with larger cache block size or prefetching.
2. Capacity Cache Misses: for a fully-associative cache, after loading the data from memory to cache, if an access to the same data later again causes a cache miss, this is called a capacity cache miss. Exploiting programs' temporal and spatial locality can reduce capacity cache misses. Increasing cache size can also minimize the capacity cache misses.
3. Conflict Cache Misses: if a cache miss happens in an A-way set associative cache, but it would not happen in a fully-associative cache, this is called a conflict cache miss. The cache is much smaller than memory, so there are a lot of cache lines in the main memory mapped the same set of cache blocks. If A+1 cache lines to be accessed map to the same set of cache blocks, a conflict cache miss will happen.
Cooperative CPU Caching

The Definition of Cooperative Caching in the Literature
Cooperative caching was first proposed in [8] . In the literature, a cooperative cache is the layer in the storage hierarchy positioned between the local client memory and the local client disk. The cooperative cache shares the physical memory of all clients. Whenever a client lacks a block in its local memory, it attempts to utilize the block from the cooperative cache, the memory of the other clients. The cooperative cache is different from the other layers in the storage hierarchy because it is distributed over the clients.
With high-speed networks access to a remote client in-memory data is an order of magnitude faster than access to local disk data. This motivates the use of cooperative caching: coordinating the memories of many machines distributed on a network to form a more effective overall storage system.
The definition of Cooperative CPU Caching in this Paper
We explore the use of cooperative caching one level higher in the memory hierarchy than that is traditionally considered. In our paper a cooperative CPU cache will be the layer in the storage hierarchy positioned between the local client CPU cache and the local client memory. The cooperative CPU cache shares the CPU caches of all clients. Whenever a client lacks a block in its local CPU cache, it attempts to utilize the block from the cooperative CPU cache, the CPU caches of the other clients.
We design a more effective index lookup strategy over the cooperative CPU cache in a cluster. The following technology trends stimulate us to cooperate CPU caches in a cluster:
1. The disparity between processor speed and memory speed is increasing. Processor performance is increasing much more rapidly than memory performance. This divergence makes it increasingly important to reduce the number of memory accesses, especially random memory accesses. Index lookup and tree traversal problems produce many random memory accesses.
2. Emerging high-speed low-latency switched networks can transfer data across the network much faster than standard Ethernet. The combined cost of index lookup in the remote L2 cache and data transfer over an older network might be more expensive than the cost of index lookup in the local memory. This is because of the large overhead due to network latency and slower network speed. With today's high-speed low-latency networks, the cost of data transfer over the network is low. Further more, in most of today's systems, communication can overlap with computation. This makes the communication cost negligible.
In general, cooperative caching allows clients to read data from the memory of peers instead of from local disk. In our paper, cooperative CPU caching allows clients to read data from the L2 cache of peers instead of from local memory. 
Design
To improve the search performance on a cluster, we propose cooperative CPU caching for index structures which is demonstrated in Figure 1 . We first decompose the sorted array into small partitions and distribute one partition to one node in the cluster. To improve cache efficiency, we define that the size of each partition should be equal to or smaller than the size of the L2 cache. At the each node, to assistant the query processing, we propose three alternative designs: binary search over one partition of the sorted array; query over a CSB+ tree index [18] with node size equal to the L2 cache line size; query over a CSB+ tree index with buffering accesses [19] .
One node in the cluster acts as the master and all other nodes that contain partitions of the sorted array in the cluster act as slaves. The number of nodes needed for slaves is the ratio of the sorted array to the size of one partition.
Each node has two processes. One is the communication process and the other is the computation process. Under the ideal conditions, when the communication process is sending and receiving messages, the computation process can concurrently work to process the stream of arriving or pending search keys. The master maintains n+1 keys (n is the number of slaves). The values of the n+1 keys are: the smallest value in the sorted array, the largest value in the first partition, the largest value in the second partition, and so on, until reaching the last partition. We call these keys the partition keys. Partition keys are used to dispatch a search key to an appropriate slave. We could have used only n-1 keys to partition among n slaves but two additional partition keys are used to test if a search key is in the scope.
The two processes on the master side, the computation process and the communication process, act as the producer and the consumer respectively. The master keeps a queue for each slave to store the outgoing search keys for the slaves. All search keys are sent to the master first, the computation process on the master dispatches each search key into an appropriate queue after comparing it with up to n+1 partition key values. The total number of the keys in the queues is the indicator of the timeout. When the total number of the keys in the all the queues increases to a defined value, all keys in queues are sent out to appropriate slaves.
The two processes on the slave side, the computation process and the communication process, act as the consumer and the producer, respectively. When a message is available, the computation process will fetch it, do lookups for each key in the message one by one, and put the results in place of the key in the message. After a message is processed, the results stored in the message are sent back to the master.
In our implementation, search keys are dispatched according to the higher bit values instead of comparing with the n+1 partition keys. The reason for this choice will be explained in Section 4.3.
Network Performance: Myrinet (High-Speed Switched Network)
Conventional Ethernet can be used on clusters, but does not provide the performance required for highperformance or high-availability clustering.
With the development of technology, gigabit networks are available for clusters. Myricom's Myrinet is one of the solutions for clusters. Myrinet is a new packet-communication and switching technology that is widely used to interconnect clusters of workstations, PCs, servers, or single-board computers.
Myrinet provides both high performance and lower latency. Myrinet can run at a speed of fullduplex 2.2 Gigabit/second. Myrinet applies cut-through switches instead of store and forward ones. All Myrinet packets carry a source-based routing header to provide intermediate switches with forwarding directions. The firmware of Myrinet interacts directly with host processes ("OS bypass") for low-latency communication, and directly with the network to send, receive, and buffer packets. Myrinet packets can be arbitrary length, so that Myrinet may carry packets of many types or protocols concurrently. Myrinet can encapsulate other types of packet without an adaptation later.
Other software interfaces such as MPI, Sockets, and TCP/IP are layered efficiently over GM.
Performance Issues On The Master Side
There are two important issues that affect the performance of the master: key comparison and message passing. The master dispatches each search key to an appropriate slave according to the search key value and the partition key values. The standard method does binary search among (n+1) partition keys to decide the appropriate slave. In our experiments, we observe that the cost of binary search is very high even within a smaller number of keys because of the high branch misprediction penalty of the CPU. In binary search, the master is a bottleneck and the overall performance doesn't increase as the number of slaves increases.
To solve the master bottleneck problem, two approaches can be applied: add more masters to distribute the workload; or use some non-comparison based searching methods, such as hashing or modulus operations that is applied in our experiments.
To improve the performance at the master side, we assume that the partition key values are uniformly distributed. It is a reasonable assumption, because it has high possibility that partition keys are uniformly distributed with a small number of partitions. If the (n+1) partition keys are uniformly distributed, the partition number can be calculated with module operation. The module operation instead of binary search has an great performance improvement. If the (n+1) partition keys are not uniformly distributed, one still can combine the module operation and comparison to reduce the number of comparisons needed to calculate the partition number for each search key.
Another observation is that with a reasonable packet size the communication cost can nearly totally overlap with computation cost in the Linux cluster configured with Myrinet.
In the old days, there are three major steps for message passing:
1. Copy data from the buffer defined by user to the system buffer defined by OS.
2. Transfer data over network.
3. Copy data from the system buffer defined by OS to the buffer defined by user.
Steps 1 and 3 are avoided in Myrinet, because Myrinet has "Operating System bypass" scheme.
Performance Issues On The Slave Side
There are two major costs on the slave sides: key lookups and sending results back to master. On the slave side, because the working data can fit into the L2 cache, there is no L2 cache misses and TLB mis ses.
On the slave side, three alternative designs are proposed: 1. Search over a CSB+ Tree index constructed over one partition of the sorted array. The CSB+ tree has been proved to have better performance than B+ tree index.
Search over a CSB+ Tree index constructed over one partition of the sorted array with buffering technique.
We apply the buffering technique between the L2 cache and L1 cache. According to size of the current message, one decides whether buffering or not. If the current message is large, buffering access can reduce the number of L1 misses. If the current message is small, the overhead to maintain the buffer may deteriorate the the overall performance. To aid the buffering access, the node size in the CSB+ tree are defined as the size of L1 cache.
3. Binary search over one partition of the sorted array. Tree index structures originally were proposed to avoid the random disk accesses caused by binary search over a disk resident sorted array. Later, people proved that tree index structures are also superior to binary search over memory resident sorted arrays. In our case, the sorted array can fit into the L2 cache. The intuition is that tree index structures may be superior to binary search for an L2 cache resident sorted array if one aligns the node size to the size of the L1 cache line size. The L1 cache is very small. Only the upper 2 levels of the tree index structure can fit in the L1 cache. For binary search, several frequent accessed elements are in the L1 cache too. Hence, with an index structure, the L1 cache performance can't be greatly improved. Also there will be extra comparison costs and extra space needed for the tree index structures. Therefore, the overall performance may be even worse with an index structure. In Section 6.2, we did the experiments to prove that binary search over the L2-cache resident sorted array is the best.
Evaluation of Cooperative CPU Caching on Index Structures
In this section, first we will introduce a model to analyze the cache performance of B+ tree index structure. The model is based on the expected number of cache line misses for each key lookup. TLB misses are not considered in your model, so our model gives a lower bound of the running time. Then we apply this model to analyze three different designs. The three different designs are a standard design, a buffering access design, and a cooperative CPU caching design. We call these methods respectively as A, B, and C in the following sections. In our evaluation model, a memory resident fully filled CSB+ tree index structure and a stream of arriving search keys are assumed. We also assume the node size in the CSB+ tree is the same size as the L2 cache line. Our model is not limited to CSB+ trees or B+ trees, our model can be generalized to any traversal problems for which the access rates to the nodes are exponentially distributed. Table 1 enumerates all the notations that will be used in our later discussion: 
The Model of Cache Performance for Tree Traversal
Lader [14] proposed a theoretical model to predict the number of cache misses for tree traversal problems. In [14] , a uniform memory access of all nodes in the tree is assumed. If each memory block for the working data set has equal access rate, the model in [14] can be applied. The theorem in [14] is not suitable for a huge number of B+ tree lookups. For index structure lookups, the access rates from leaf nodes to the root node are exponential distributed. For a huge number of B+ tree lookups, the nodes at upper levels of B+ tree are frequently accessed, but the nodes at lower levels of the B+ have contrastingly lower probability to be accessed. Hence, the distribution of the access rates to all the nodes in the B+ tree are exponential but not uniform.
Hankins [11] addressed the exponential distribution of the access rates to all the nodes in the B+ tree. It considers compulsory cache misses but not capacity cache misses. It assumes a large L2 cache that can contain the whole B+ tree. It modeled the exponential distribution of node accesses for a huge number of the B+ tree lookups. For a smaller B+ tree that can fit in L2 cache, we can apply this theorem directly. However for a memory resident B+ tree that can't fit in the L2 cache, this model doesn't work any more.
According to [11] , for a B+ tree that can fit in the L2 cache, the expected number of cache misses for each key lookup is:
and:
In the above formula, λ i is the number of cache lines at the ith level of the tree, q is the total number of keys to be lookuped.
We use [11] at the base and further explore the model for a memory resident B+ tree. In most cases, the B+ tree is memory resident and can't fit in the L2 cache. The capacity cache misses are an important factor for a huge number of key lookups.
We analyze the problem in two steps:
1. We assume that the B+ tree space touched by the first q 0 lookups is exactly the size of L2 cache. The cache state is marked as the state S0.
2. The number of caches misses for each key lookup after the q 0 th lookup are:
For the (q 0 +1)th lookup, the space needs to load from memory to cache is calculated in Equation 4. After the (q 0 + 1)th lookup, the cache state is same as the state of S0. Hence, for all the following lookups, each lookup needs to load the space (calculated in Equation 4 ) from memory to cache.
Index Structure Analysis for Search Operands without Cooperative CPU Caching
In our evaluation, we examine only the data cache behavior, while ignoring the instruction cache and TLB misses. In the three methods, the instruction complexity is comparable.
The method A and method B can be greatly affected by TLB misses, but method C can't be affected by TLB misses. In methods A and B, the working data set is very large. So it works on many pages. The TLB can hold only a few entries. On the Pentium 4, there are 64 entries in the TLB. For applications working on a large dataset, TLB misses happen frequently. In method C, the working data set is small and stored contiguous in the memory space. Hence, in method C, there will be few TLB misses except immediately after a cold start. The analysis results will yield the lower bound running time for methods A and B, and a more accurate running time for method C.
Method A: Standard Method
For each key lookup the cost for the standard one-by-one key lookup is:
The first term is the computation cost and the other terms are the memory access costs. Any path from the root node to one of the leaf nodes consists of T nodes for a tree with T levels. The measured computation cost spent on one node is Comparison Cost Node. Therefore, the computation cost is
The memory access cost consists of two parts: buffer access cost and tree access cost. There are two buffers: the input buffer to store search keys and the output buffer to store search results. The input buffer and output buffer access costs are 4/W 1 each, because input buffer and output buffer are accessed sequentially. The sequential access to main memory proceeds at the full bandwidth of RAM. The tree access cost is calculated according to equation 4 with q ≫ q0. We ignore the time spent to access data in the L2 cache, because access to data in memory dominates the time. Intuitively, the frequently accessed upper two levels of the tree have higher probability of remaining in cache, but the lower levels of the tree are usually not in the cache. Hence, a cache miss typically happens at each level while accessing the lower part of the tree. Furthermore, prefetch can't help. To decide the next node to be accessed, the comparison result for the current node would have to be known.
Method B: The Buffering Access Method
Zhou [19] proposed the buffering accesses method for an arriving stream of search keys showed in Figure 2 . The index tree is logically decomposed into several subtrees that can fit into the L2 cache. A buffer is maintained for each subtree to store the search keys that have been passed down from higher subtrees. The stream of arriving search keys is partitioned into batches. For each batch lookup, all search keys in the same batch are examined at the root subtree and dispatched to appropriate buffers of the child subtrees with the usual lookup protocol. This process continues until reaching the leaf nodes whereupon the search results are produced.
The buffering access method improves the batch lookup performance by exploiting the temporal and spatial locality. Comparing with method A, the cost of loading a standard node of the index tree to memory is amortized over all search keys stored in the buffer of the subtree that contains the node. For each search key, the cost is:
The first term represents the computation cost explained in Section 5.2.1 . The other terms represent memory access costs. The tree access cost is △. The memory access cost to read a key from a buffer is 4/W 1 , because each buffer is sequentially accessed. Because there are totally T /L subtrees, each search key needs to be read from buffers T /L times. Therefore, the total memory access cost to read a search key from buffers is
The memory access cost to write a search to an appropriate buffer is
, because each time a write buffer is selected according to a random key value. With T /L subtrees, each search key will be written to a buffer T /L − 1 times, because at the bottom subtree results are produced and pipelined to the next operator. Hence, the total memory access cost to write a search to buffers is
× (T /L − 1). There are two parts for the tree access cost: time spent to load the subtrees from memory to L2 cache one by one (θ 1 ) and time spent to access the subtree in the L2 cache after a subtree has been loaded into L2 cache (θ 2 ). The time spent to load all the subtrees from memory to L2 cache can be calculated with Equation 1 because each subtree can fit in the L2 cache. For each key lookup, the average number of L2 cache misses are:
For each lookup, the number of nodes to be accessed is T and the number of L2 cache lines to be accessed is also T , because the size of node is same as the size of L2 cache line. The number of cache lines to be accessed in the L2 cache will be T −
. Therefore, the time spent to access to data in L1 cache will be:
Impact of Prefetching and Double Buffering
With prefetching and double buffering in the L2 cache, time to load the next subtree from memory to the L2 cache may overlap with the computation time spent on the current subtree. When a subtree is being visited and all search keys in the buffer are being examined, the next subtree can be concurrently loaded into L2 cache with prefetching techniques. In this case, the size of the subtree should be smaller than half the size of the L2 cache. With smaller subtrees, both the number of subtrees and the number of buffers to be maintained will increase. Also, applying software or hardware prefetching will produce more load instructions. With more load instructions, the memory bus will spend more time to make decisions. With prefetching and double buffering, θ 1 = max{
Size T ree/B 1 N U M keys per batch , T × B 1 Miss P enalty}. Without prefetching and double buffering, the part of the subtree is randomly loaded from memory to cache. With prefetching and double buffering, the whole subtree will be loaded from memory to the L2 cache sequentially. With prefetching and double buffering, it makes full use of the memory bandwidth, but it also requests more memory accesses.
For smaller batch, the prefetching and the double buffering is not helpful. The time to load a subtree from memory to L2 cache will be amortized over a smaller number of search keys. For each small batch, lookup of the subtree in the L2 cache costs less than loading a subtree from memory to the L2 cache with full memory bandwidth. Hence the time to load subtrees from the memory to the L2 cache will dominate the time. The improvement of prefetching and double buffering is proportional to the utilization of the subtree. The utilization of the subtree is low for smaller batches. Hence, the improvement is marginal for smaller batches.
For larger batches, the prefetching and the double buffering have tiny improvement on the overall performance. Without prefetching and double buffering , the value of θ 1 can be ignored. However, the other overheads explained in the previous paragraphs will be added in.
Method C: Cooperative CPU Caching for Index Structures
The following are the assumptions for method C. We make these assumptions to simplify the analysis. In practice these assumptions are not required.
1. Aggregate network bandwidth is unlimited.
2. There are enough nodes in the cluster so the the aggregate L2 caches over the cluster can hold the entire index structure. Each node does computation and data access in cache.
3. T < 2L, so that each search can be done within the caches of just two nodes: a master and a slave. Here, we make this assumption to make the model simpler. In practice, if T > 2L, each search needs to traverse more than the caches of two nodes and our design still can be applied.
4. At a given instant, there are no clashes in which two masters try to talk to the same slave.
5. Masters and slaves do their tasks in parallel.
6. The computation on the master doesn't become a bottleneck. The workload between masters and slaves is fair. The ratio of the number of slaves to the number of masters is same as the ratio of the total workload at slaves to the total workload at masters.
For each search key, the average cost is
In Equation 9, the first part is the cost on the master side and the second part is the cost on the slave side. The maximum value is the real cost because masters and slaves do tasks in parallel. The following explains how to calculate the costs on the master side and the slave side.
Cost on the master side for each search key:
1. Computation time: Dispatch Cost P er Search Key. This cost depends on the distribution of search key values as explained in Section 4.1. If this cost is too high, more masters are desired according to the ratio calculated in the assumption 6.
2. Memory access time: 8/W 1 . This cost is to read a key from the search key array and put the key to a buffer for an outgoing message. Because accesses to the search key array and the buffer are both sequential, full memory bandwidth can be used to transfer data.
3. Communication time: 4/W 2 . For each search key, network transmission time is considered but not latency, because keys are sent out in a message with the size given in kilo byte magnitude.
The cost on the slave side for each search key:
Experimental Validation
We did all experiments on a Pentium III Linux cluster. The parameters about the tree structure used in all experiments are reported in Table 2 except specifically explained. Both the search keys and the keys used to construct the index structure are randomly generated. 
Environment and Model Verification
First we wrote programs to measure the environment parameters of the Linux cluster. We measured the memory bandwidth, network bandwidth, L2 cache line miss penalty, L1 cache line miss penalty, comparison cost at a node whose size is equal to the L2 cache line size. These numbers are reported in Table 3 . According to the measured parameters and equations in Section 5, we predicted the average cost for a query with three different methods. These are reported in Table 4 . We also did experiments to show the accuracy of our evaluation. In Table 4 , the batch size equal to 128 KB is applied, and one master and ten slaves are used in method C. Table 4 shows that our model has over 90% of accuracy.
Evaluate The Three Alternative Strategies on the Slave Side in Method C
With our cooperative CPU caching design, there are three alternative strategies introduced in Section 4.4 to query over a L2 cache resident sorted array: query over a CSB+ tree with traditional protocol (C-1); and query over a CSB+ tree with the buffering accesses (C-2); binary search over the sorted array (C-3).
We did experiments to compare the three alternative strategies on the slave side. We run 1 million queries just on one node in the Linux cluster. In the experiments, a 256 kilobyte sorted array is used. Table 5 shows that method C-3 has the best performance and method C-1 has the worst performance. C-1 and C-2 improves the L1 cache performance with setting up the nodes size equal to the L1 cache line size and L1 cache size respectively. However, query over an index in C-1 and C-2 demands more memory space showed in Table 5 and executes more instructions comparing binary search over a sorted array in C-3.
Strategy
Memory Space Used experimental average time Method C-1: 329 KB 319 ns Method C-2: 320 KB 267 ns Method C-3: 256 KB 238 ns 
Comparing Three Different Methods: A, B and C
In all experiments, an index tree described in Table 2 is applied. 8 million search keys are randomly generated. For methods A and B, the 8 million search keys are looked up on one node. For method C, 11 nodes are used, one acts as the master, and the others act as slaves. To make a fair comparison, normalization is applied: the running times measured for methods A and B are divided by 11. In Figure 3 , the x-axis shows the increase of the batch size. The y-axis shows the running time for 8 million search keys. We did experiments for batch sizes ranging from 8 KB to 4 MB.
All the experiments show that method C-3 has the best performance according to two measurements: throughput and response time. According to Figure 3 , Method C-3, with smaller batch sizes, has better performance than method B with even larger batch sizes. For instance, method C-3 with a 64 KB batch size has better performance than method B with a 256 KB batch size. With smaller batch sizes, the query response time is faster, with larger batch sizes, the query response time is slower. Figure 3 shows that method A has stable performance, because it doesn't consider the size of a batch. Method B outperforms method A for batch sizes larger than 64 KB, but performs worse than method A for batch sizes smaller than 32 KB. For smaller batch sizes, because the overhead to maintain a buffer and the cost to load a subtree from memory to cache are amortized over a small number of keys, the amortized cost is high. For larger batch sizes, because the overhead to maintain a buffer and cost to load a subtree from memory to cache are amortized over a larger number of keys, the amortized cost is low. Hence, as the size of batch increases, method B the better performance. When a batch size is over 2 MB, method B's performance is stable and can't be improved much with increasing batch sizes. As explained in Section 5.2.2, the only factor affected by the size of a batch is θ 1 . As the number of keys in a batch increases the value of θ 1 will decrease. When the size of the batch increases to a fair large value, θ 1 ≈ 0. So with increasing size of the batch, θ 1 is still 0 and the total performance will not change much.
For batch sizes larger than 32 KB, method C-3 is the best. Method C-3 has both higher throughput and faster response time. Also, as the size of batch increases, method C-3 has better performance. When the batch size increases to 512 KB, method C-3 has stable performance. With increasing batch sizes, communication cost more likely overlaps with computation cost. Hence, with batch sizes over 512 KB, computation cost accounts for the total cost and doesn't change with different batch sizes.
Methods C1 and C-2 follows the same trend as Method C-3 with the increasing batch sizes, but it has a slightly worse performance. As demonstrated in Section 6.2, methods C1 and C2 have higher computation cost than method C-3.
If a batch size is smaller than 32 kilo bytes, methods C-1, C-2, C-3 are the worse than method B and method A, because communication latency overhead is a large penalty. Superlinear Speedup Achievement: According to Figure 3 , the speedup of 16.8 is achieved with method C-3 running on 11 nodes and method A running on one node for batch sizes larger than 1 MB. The speedup of 13.3 is achieved with method C-3 running on 11 nodes and method B running on one node for a batch size equal to 256 KB. From Figure 3 , comparing all strategies in method C with method A, superlinear speedup is achieved for batch sizes larger than 32 KB. comparing the strategies in method C with method B, superlinear speedup is achieved for batch sizes ranging from 32 KB to 512 KB. Comparing method C-3 with method B, superlinear speedup is achieved for batch sizes larger than 16 KB.
Conclusion
Both the conventional query method and buffering access query method over a CSB+ tree produce many cache and TLB misses. Cooperative CPU caching is proposed to solve the problem. Cooperative CPU caching over a Linux cluster is implemented and evaluated. Three alternative query strategies over a in-cache sorted array are proposed and evaluated. Combining cooperative CPU caching and the best strategy over a in-cache sorted array, our new design method achieves super-linear speedup comparing with both the traditional index query method and the buffering access query method. Our new method provides both higher throughput and faster response time. Our new query method achieves speed up of 16.8 with 11 nodes comparing with the traditional index query method. Our new query method achieves speed up of 13.3 with 11 nodes comparing with the buffering access method. Our new model to analyze the query performance is also evaluated with experiments and yields over 90% accuracy.
According to the model, our new query method is affected by CPU speed and network speed. The traditional method is affected by CPU speed and random memory access latency. The buffering method is affected by CPU speed and memory speed. According to the technology trends, CPU speed will double every 18 month. Memory speed will increase at a much lower rate at 10% year. Network speed is catching up memory speed and may grow at the same rate as the memory speed. Random access memory latency will be stable because of the hardware-inborn difficulty and increasing cache line size. In Figure 8 , we also predict the running time of method C-3 in the future with our evaluation model. We also predict the speedup of method C-3 over method A and method B respectively. In five years, we predict method C will be 80 times faster than method A and be 40 times faster than method B with 11 nodes. From Figure 8 , the speedup of method C-3 to method A is exponential increasing and the speedup of method C-3 to method B is linear increasing. Hence, the speedups of method C to method A and method B will follow the technology trends.
Future Work
For experiments with batch sizes over 4 MB not show here, we observe that method C has worse performance than with smaller batch sizes. We believe that the lost of performance gain is an artifact due to the network protocols and hardware that we are using. Although we have not yet been able to assign a definite cause. We expect to do this in the future. In practice, this issue does not affect the importance of our results, since we can always choose to send more messages with smaller sizes in order to achieve the best performance of method C.
Acknowledgment
We thank the Mariner Project at Boston University for providing the experimental facilities. 
