Hashing is a fundamental technique in computer science to allow O(1) insert and lookups of items in an associative array. Here we present several thread coordination and hashing strategies and compare and contrast their performance on large, shared memory symmetric multiprocessor machines, each possessing between a half to a full terabyte of memory. We show how our approach can be used as a key kernel for fundamental paradigms such as dynamic programming and MapReduce. We further show that a set of approaches yields close to linear speedup for both uniform random and more difficult power law distributions. This scalable performance is in spite of the fact that our set of approaches is not completely lock-free. Our experimental results utilize and compare an SGI Altix UV with 4 Xeon processors (32 cores) and a Cray XMT with 128 processors. On the scale of data we addressed, on the order of 5 billion integers, we show that the Altix UV far exceeds the performance of the Cray XMT for power law distributions. However, the Cray XMT exhibits greater scalability.
INTRODUCTION
In this paper, we present a set of novel thread coordination and hashing strategies. We evaluate their performance empirically on two large, shared memory systems: an SGI Altix UV with 4 Xeon processors (32 cores) and a half terabyte of main memory, and a Cray XMT with 128 Threadstorm processors and one terabyte of memory. The particular use case we investigate is that of batch insertion of a large set of key-value pairs, S = {(k1, v1), (k2, v2), ..., (kn, vn)}, into a hash table, where each insertion results in an update to the currently stored value. For example, when inserting the pair, (ki, vi), the process is as follows:
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SC11, November 12-18, 2011, Seattle, Washington, USA Copyright 2011 ACM 978-1-4503-0771-0/11/11 ...$10.00.
1. Calculate the hash of the key according to the selected hash function: h k i ← hash(ki). For simplicity, we will ignore, for the moment, the possibility of hash collisions.
2. At location h k i , there is an associated value, We call this operation of updating a key's stored value with additional information as an updateInsert with three parameters, a key, a value, and the associative and commutative update function. Note that, while we focus on the updateInsert operation for this paper, the framework we present easily extends to a vanilla insert operation (i.e. if a value already exists for a given key, the insert results in a noop). Also, lookups proceed similarly to the above procedure except, instead of updating the value, we simply return it. We do not handle the case of deletions. We have investigated extensions that allow for deletions, but they add overhead and are not necessary for the use cases we address in this paper.
The tasks we target are in support of two fundamental programming paradigms: dynamic programming [1] and MapReduce [4] .
The basic idea behind dynamic programming is to compute the optimal solution to a complex problem by 1) breaking it down into subproblems, 2) finding the optimal solution to each subproblem, and 3) combining them. Often, many of these subproblems are identical; one of the speedups obtained by dynamic programming is through avoiding duplicated computation. One way to do this is with a hash table that supports inserts and lookups [16] . The idea is to perform a "top-down" approach with recursion, starting from the original problem and diving down into subproblems. The solutions of subproblems are stored in a hash table. Thus, threads can first check the hash table to see if a subproblem has already been solved. If so, it can simply use the answer stored in the hash table. Otherwise, it will perform the work and insert the solution in the hash table for consumption by others.
To see how batch inserts (and lookups) can be applied in a dynamic programming setting, first consider the parallelization scheme by Stivala et al. [16] presented below in Figure  2 . The pseudo-code follows this recursive scheme:
f (x) = if b(x) then g(x)
else F (f (xi), ..., f (xn)) The function b(x) returns true if a base case is satisfied, g(x) is the result of the base case, and F is a function that combines the optimal answers to the subproblems. The way that Stivala et al. approach parallelization is to have each thread solve f (x) as if it is independent of all the other threads. Each thread explores the space in a depth-first manner, though each branch is selected randomly. The random order of evaluation of branches is what allows the problem to be solved in parallel, otherwise every single thread would be duplicating the exact same work. Coordination among the threads is through parallel lookups and inserts to a shared hashed table, and considerable savings can be exploited when the various paths explored by the threads share common subproblems.
return v Figure 2 : Listed above is a general approach to parallelizing dynamic programming presented by Stivala et al. [16] that we modify to use batch updates and inserts.
To modify this approach for batch lookups and inserts, one could divide the work into three phases: search, lookup, and insert. In the search phase, each thread follows the execution outlined in Figure 2 until it reaches either a lookup or an insert. These values are added to one of two global lists, one for lookups, and one for inserts. The lookup and insert phases then process these lists in batch. To utilize the mechanisms we present in this paper, we expect that the number of threads running during the search phase must be much greater than the number of threads active during the lookup and insert phases.
Besides dynamic programming, the applicability of our hash table implementation widens further when we consider how it can be used as the basis for a MapReduce implementation on a shared memory architecture. MapReduce is a programming paradigm popularized by Google and described by Dean and Ghemawat [4] . We will first describe MapReduce from the backdrop of the originally targeted ar-chitecture, that of distributed memory clusters. We will then discuss how these concepts map to a shared memory design. Figure 1 shows the overall process. Raw data is partitioned and given to a set of processes each executing the same function, named the map function. Each map process scans through its partition of data and emits key-value pairs (k, v). A common example application for MapReduce is to count word frequencies within a document corpus. The map function, in this case, scans through a portion of the document corpus. Any time a word is encountered, the map function emits the word as the key and sets the value to be one, indicating one occurrence of the word.
We will defer discussion of the combine phase for a moment and talk about the reduce phase. All of the key-value pairs emitted during map are collated together, so that all of the values associated with a given key are co-located on a single node. The reduce function is then applied to the set of values for a given key. For the example of counting words, the reduce function is a simple summation of the value set for a given key.
This collation of the key-value pairs, sorting on the keys, can be very expensive in terms of network bandwidth if one were to simply broadcast all key-value pairs. Also, for power law distributions, the number of pairs for commonly occurring keys may exceed the working memory of a single node. For instance, the word the dominates the distribution of words for English language corpora, and sending all the instances of the word the to one single node is not advisable.
A common way to address this is with a combine function. The combine function, which is usually identical to the reduce function, is applied locally at each node to the set of key-value pairs emitted by the local map process. Thus, instead of each node sending a large number of key-value pairs over the network for the word the, using the combine function allows each node to emit only one pair, the total summation of the local to the node. This only works if the reduce operation is both commutative and associative. Otherwise, all pairs must be sent and collected together by key in their entirety. The restriction of commutativity and associativity for the reduce function still lends itself to a wide range of applications, and this assumption is employed in the design of higher-level languages built on top of MapReduce, such as Sawzall [13] .
By considering only reduce operations that are both commutative and associative, on a shared memory machine we can completely avoid the collation of key-value pairs. Also, we can execute the reduce function whenever the map func-tion emits a key-value pair, in essence performing reduction in place, collapsing the map and reduce phases together. This is accomplished by means of a global hash table. For each key-value pair emitted, we call updateInsert(key, value, update), where update is the reduce operation that we perform on the currently stored value in the hash table and the value. Thus, to port the concept of MapReduce onto a shared memory architecture for commutative and associative reduce operations, all you need is a hash table. Many algorithms are being developed for MapReduce on clusters. The hash table implementation we present here can be used as a kernel for porting these algorithms quickly to shared memory platforms.
Another important factor we address in this paper is the underlying distribution of the data, insuring that, regardless of the distribution, performance and scalability are similar. Uniform random distributions are most often addressed in the literature [15] , [12] , [14] , [10] . However, many important, real-world problems involve power law distributions. To name but a few: natural language corpora [18] , the occurrence of substrings in DNA sequences [11] , and the distribution of degrees in internet graphs [3] . We present a strategy that scales well on both uniform random and power law distributions.
PREVIOUS WORK
In the past decade, there has been significant interest in developing lock-free algorithms and data structures. The term lock-free encapsulates the notion that no thread can block system-wide progress. Thus, if any thread is delayed, it does not delay or impact the progress of other threads. A significant area of focus is that of transactional memory [9] , which allows groups of memory operations to operate atomically, thus simplifying the programming paradigm.
Besides transactional memory, much research in lock-free algorithms has focused on atomic compare-and-swap operation, CAS, provided by many architectures. The function, taking three parameters, proceeds as follows. If the value in the memory location specified in the first parameter is equal to the value of second parameter, then the memory location is updated with the value of the third parameter. Otherwise, the memory location is left unmodified. Regardless of whether an update occurred, the content before the CAS operation is returned.
Gao et al. [7] present a lock-free synchronization primitive using CAS that avoids the ABA problem. The ABA problem is one where a thread reads a value A, and then performs CAS on the location of A to set it to another value. The intent is to only execute the change if A did not change between the read and the CAS call. However, between the read and CAS call, another thread may have changed A to B and then back to A, thus breaking the intended logic. In another paper, Goa et al. [6] present a robust, lock-free, open addressing scheme for hash tables that can operate effectively in heterogeneous environments. Shalev and Shavit [15] , using CAS once again, develop an extensible hash table. Fraser and Harris [5] extend CAS to implement a multiword compareand-swap that can operate on an arbitrary set of memory locations simultaneously, and also present other functions to aid in development of lock-free data structures. One guiding principle they employ is that of disjoint-access parallelism, which means that updates to non-overlapping locations can be performed concurrently. We use this principle in all of our synchronization primitives, as will be explained next.
Our work presents four synchronization primitives, only one of which is completely lock-free. However, we find this to not be much of an issue, as performance is dictated by other factors. To understand this, consider that the hash table can be thought of as a large array, where each item, or bucket, in the array is used to store the data for one key. We perform fine-grained locking at the level of the bucket. Thus, if one thread locks out a bucket, it only affects other threads trying to access that bucket. Furthermore, we only perform locking when a bucket is first declared occupied for a particular key. Access to the bucket is otherwise lock-free once the bucket's key has been set.
SUPERCOMPUTERS UNDER STUDY
This section describes the two machines we use for this study: the Cray XMT and the SGI Altix UV.
Cray XMT
The Cray XMT is a unique shared memory machine with multithreaded processors especially designed to support finegrained parallelism and perform well despite memory and network latency. Each of the custom-designed compute processors (called Threadstorm processors) comes equipped with 128 hardware threads, called streams in XMT parlance. The processor, instead of the operating system, has responsibility for scheduling the streams. To allow for single-cycle context switching, each stream has a program counter, a status word, eight target registers, and 32 general-purpose registers. At each instruction cycle, an instruction issued by one stream is moved into the execution pipeline. The large number of streams allows each processor to avoid stalls due to memory requests to a much larger extent than commodity microprocessors. For example, after a processor has processed an instruction for one stream, it can cycle through the other streams before returning to the original one, by which time some requests to memory may have completed. Each Threadstorm can currently support 8 GB of memory, all of which is globally accessible. The system we use in this study has 128 processors and 1 TB of shared memory. The processor speed of the Threadstorm processors is 500 MHz.
Programming on the XMT consists of writing C/C++ code augmented with non-standard language features including generics, intrinsics, futures, and performance-tuning compiler directives such as pragmas. Generics are a set of functions the Cray XMT compiler supports that operate atomically on scalar values, performing either read, write, purge, touch, or int_fetch_add operations. Each 8-byte word of memory is associated with a full-empty bit (FEB) and the read and write operations interact with these bits to provide light-weight synchronization between threads. We refer to the set of operations interacting with the FEB as FEB semantics. Here are some examples of the generics provided:
• readXX: Returns the value of a variable without checking the full-empty bit.
• readFE: Returns the value of a variable when the variable is in a full state, and simultaneously sets the bit to be empty. The stream is blocked if the bit is in the empty state.
• readFF: Returns the value of a variable when the variable is in a full state, and leaves the bit as full.
• writeEF: Writes a value to a variable if the variable is in the empty state, and simultaneously sets the bit to be full. The stream is blocked if the bit is in the full state.
• int_fetch_add: Atomically adds an integer value to a variable.
Parallelism is achieved explicitly through the use of futures, or implicitly when the compiler attempts to automatically parallelize for loops. Futures allow programmers to explicitly launch threads to perform some function. Besides explicit parallelism through futures, the compiler attempts to automatically parallelize for loops, enabling implicit parallelism. The programmer can also provide pragmas that give hints to the compiler on how to schedule iterations of the for loop on various threads, which can be by blocks, interleaved, or dynamic. In addition, the programmer can supply hints on how many streams to use per processor, etc. We extensively use the #pragma mta for all streams i of n construct that allows programmers to be cognizant of the total number of streams that the runtime has assigned to the loop, as well as provide an iteration index that can be treated as the id of the stream assigned to each iteration.
The XMT attempts to make memory access look uniform through its hyper-threaded architecture. However, there are two aspects that a programmer can utilize to exploit the NUMA aspects of the machine. The first is that accessing memory locations local to the processor is faster than accessing from non-local. The second is that each processor has a memory controller with a 128KB buffer. If one can keep the content of that buffer relatively stable and unchanging, then one can perform memory operations at the clock rate of the processor.
Many of the results outlined later utilize code from the MultiThreaded Graph Library (MTGL) 1 [2] , an open source library with data structures and algorithms specifically designed to run scalably on shared-memory platforms such as the XMT.
SGI Altix UV
The SGI Altix UV is another large-scale, shared memory system but, unlike the XMT, uses industry standards and commodity parts such as Xeon processors. It is a cachecoherent system. Since industry standards are supported, one can use standard C/C++ instead of the more specific generics and intrinsics of the XMT. Also, the operating system is Linux. To get parallelism, we used a combination of OpenMP 2 and Qthreads 3 [17] . The latter is an open source library that presents a general API for thread management and coordination that hides machine-specific implementation details. The Altix UV comes in three different flavors, the 10, the 100, and the 1000, the difference being to what degree the architecture can scale. The Altix UV 10 is limited to 4 Xeon processors and a terabyte of memory, while the 1000 can have up to 256 processors and 16 terabytes. The machine we use is the Altix UV 10. It has 4 Xeon X7550 processors that run at 2.00 GHz and has 18 total megabytes of cache per processor. Each processor has 8 cores and 16 total threads. The amount of memory for the system is half a terabyte.
Some important considerations for this architecture and software stack are the memory hierarchy and the paging system. The Altix UV is much more susceptible to memory latency issues than the XMT. To get scaling, one must design the algorithm to exploit the caches, and keep memory accesses as close and as dedicated to each processor as possible. Another factor that comes up in our experiments is the memory paging system used by Linux. Linux is a demand-based system, meaning that pages of memory are not mapped into physical memory until first referenced. We found that the process of mapping these pages into memory does not scale well, and that a process for priming memory is needed in order to get good scaling on algorithm run time.
THREAD COORDINATION
This section discusses the mechanisms we use for thread coordination to ensure algorithmic correctness. We discuss these in terms of two different situations, that of known and unknown key distributions. For all of the hashing strategies we employ, we use open addressing and linear probing. Basically the hash table can be thought of as a large array. Hashing provides an index into the hash table array, and the process of inserting or updating the value for a key probes from the original index until it finds the appropriate location, which is described in more detail below.
Known Key Distributions
We first deal with the case where the key distribution, K, is known and, in particular, where we know that a certain key, ku, will never be inserted into the hash table, i.e. ku ∈ K. This is useful for cases when ids are assigned to a data set, and the curator of the data has control over how ids are assigned to particular data items. In such situations we can use ku to signify that a bucket in the hash table is unclaimed.
As explained previously, we use open addressing on a large array, We will later discuss various strategies for organizing the data in order to exploit the NUMA aspects of the architectures under study, but at the heart of inserting the data is the procedure in Algorithm 1.
Line 2 performs the hash function:
where C = 31, 280, 644, 937, 747 is a large prime constant (taken from [8] ). There are many candidates for hash functions, but that is not the focus of this work. This simple hash function was sufficient for our needs. Line 3 records the results of the hash into another variable, start, the index from which we linearly probe forward until the correct bucket is found. Should the end of the array be reached, we wrap around to the beginning. If the table is ever completely filled, the repeat loop exits. However, for the use-case under which we are operating, the size of the data to be inserted is known beforehand. Managing the load factor of the table is relatively straight-forward and this degenerate case will never be reached. end if 10:
j ← j + 1 11:
if j == |table| then 12:
j ← 0 13: end if 14:
until j == start 15: end procedure
In the body of the repeat loop, the first step on line 5 is to call CAS on the j th key location in the hash table array. On line 6, we examine the results of the CAS operation. If the previous value, prev, is ku, then we know that this is the first thread to ever touch this bucket of the hash table array. In essence, the thread claimed the bucket for the thread's current key, k. It can then perform an atomic update. For our application, that of counting keys, we increment the value portion of the table array by 1. This addition operation has an atomic variant for both the XMT and the UV. If an operation is desired that cannot be updated atomically, then a fine-grained locking mechanism would need to be employed to update table[j].value.
Also on line 6, if prev is not ku, then this thread is not the first to touch the j th element of the table array. However, if the assigned key in the j th is the same as the key under consideration, the thread can still proceed with the atomic update to the value.
Note that, since we make use of the atomic CAS operation, this approach is lock-free, assuming that the update procedure is a non-blocking operation or a set of non-blocking operations.
The Cray XMT lacks an atomic CAS, thus we present an implementation specific for its needs, using FEB semantics, in Algorithm 2. Line 5 first checks to see if the bucket has an associated key. If so, it then checks to see if the bucket's key matches the current key. If so, it can then perform the update. Otherwise, the thread continues to probe forward. Note that this is a non-blocking read of the bucket's key. After a bucket is declared as occupied, access to the bucket is non-blocking, assuming the update operation is non-blocking.
If the comparison on line 5 turns out to be false, indicating that a bucket does not have an assigned key, the thread attempts to claim the bucket, on line 11, with a call to readFE on the bucket's key. The first thread to gain control of bucket's key will set it to the current key on line 13. All other threads will have to check if the key assigned while they were waiting matches their current key. If so, they perform the update, otherwise they continue to probe. Note that while this is a blocking algorithm, it only blocks a specific bucket while the bucket is being claimed for a key. It should be noted that the key type is limited to data types that are 8-bytes or larger, else the full/empty bits do not align properly. 
Unknown Key Distributions
For the more general case of where the key distribution is unknown and we cannot specify with certainty that a particular k is not in K, another strategy is warranted. Now, each bucket in the hash table array is a triple, (key, value, lock). The new variable, lock, is generally either a 32-or 64-bit integer, depending on the smallest size that can be operated on with either CAS or FEB semantics. It handles thread synchronization when a bucket is declared claimed. As before, the table is initialized so that each i ∈ [0, |table| − 1], table[i].value ← 0. However, for this case, no special consideration is needed for the key field. The lock variable now indicates the status of a bucket, and the key field is set based upon the status of lock. As we discuss the outline of this procedure, note that the general outline is the same as that of Algorithm 2. The main difference being that we use CAS instead of FEB semantics to perform blocking to ensure that only one thread can claim a bucket.
The first conditional within the body of the repeat loop checks to see if a bucket is claimed and, if so, checks to make sure the key to be updated matches the key of the bucket. until j == start 32: end procedure
In that case, the thread can perform the update to the value associated with the bucket. If the keys do not match, the thread continues probing forward.
If the bucket is unclaimed, the thread attempts to claim it for key k. The thread enters a blocking loop on line 12 that attempts to use CAS to change the lock variable from unclaimed to intermediate. The thread exits this loop when either it successfully sets lock to intermediate or lock has been set to claimed by another thread.
If the thread successfully set lock to intermediate (note that only one thread can succeed in this regard), it then sets the bucket's key variable to k and sets lock to be claimed. Also, it performs the atomic update to value. Note that, since key is updated before lock is set to claimed, we are guaranteed that any threads entering the first conditional on line 5 will have the necessary information. In other words, if a thread finds a bucket claimed, then its key has also been assigned.
After exiting the repeat loop of line 11, if a thread finds that the lock variable is not in the intermediate state but is already claimed, it then proceeds to check key to see if it matches k. If so, the value is updated. Otherwise, the thread continues probing.
The algorithm using FEB semantics is presented in Algorithm 4. This particular strategy was presented earlier in [8] , but we outline here for the sake of completeness. The until j == start 31: end procedure procedure has a similar flow to the one using CAS. This time there is no explicit repeat loop that handles blocking the threads when a bucket is in the process of being claimed. For the XMT, this is handled at the hardware level.
Other Primitives
Besides updateInsert, there are two other functions that we use. The first one we will discuss is claiming a key for a bucket in a thread-dedicated table (Algorithm 5). The value associated with the bucket is untouched. Because the table is only accessed by one thread, we can ignore the thread synchronization concerns. In the interest of space, we only list the algorithm for known key distributions. For unknown key distributions, the procedure is similar except that the comparisons are done against the lock variable in the table array using the CL and UNCL constants. Also, the first conditional sets the lock variable to CL.
The updateInsert function inserts a key into the table if it does not exist. We have the need in one of our approaches for just the update procedure. If the key to be updated does not exist, the table is unchanged. Again, we are working in the context of a thread-dedicated table, so thread synchronization is not a concern. Algorithm 6 only presents the algorithm used when the key distribution is known. The function returns true when an update occurred, and false otherwise. 
HASHING STRATEGIES
We now present two hashing strategies that employ the previously discussed insertion procedures as algorithmic primitives for inserting in batch a large set of key-value pairs into a hash table. The first strategy we discuss is a straightforward utilization of the primitives that ignores cache and NUMA-related concerns. We refer to this method as Cacheoblivious or CO for short. The second method tries to overcome the failings of CO by dealing first with frequently occurring keys in a thread-local manner before a global update occurs. We refer to this method as Removing heavy-hitters (RHH). We will refer to the combination of CO with the unknown key distribution strategy as CO U and with the known key distribution as CO K. Similarly for RHH U and RHH K.
Cache-oblivious
As stated earlier, the CO method is uncomplicated. Given a set of key-value pairs, KV , the method simply iterates through all of them, calling updateInsert. Our implementation uses block scheduling, meaning that each thread is assigned one contiguous block of key-value pairs. Since we use a simple open addressing scheme, the table array must be instantiated so that the load factor is beneath 0.7, otherwise performance degrades rapidly. We determine the correct size manually, but it would be relatively simple to sample the data first to determine an appropriate size. 
Performance
Here we discuss the performance of the CO strategy performing a counting exercise on integer data sets, i.e. the keys are integers, the value associated with each integer is 1, and the update function is addition. We examine four different data sets. The first data set is 5 billion uniform random integers grabbed from the range [0, 2 64 − 1]. The last three are also 5 billion integer data sets, but drawn from a power law distribution. We use the Pareto distribution that has the following probability density function:
We let xm be 1 and we use three different values for α: 0.25, 1, and 2. To get a set of integers, we discretized the values produced using the floor function. Because we found that the given hash function performed close to perfect when using the set of integers given by the initial distribution, we mapped each key to a random value in [0, 2 64 − 1]. This was done to increase the likelihood of collisions in the hash table. For α = 0.25, we arrive at a distribution that is close to Zipf's law [18] , i.e. that the frequency of the n th ranked item is twice that of the 2n th ranked item. For α = 1, 2, the curve becomes much steeper, with α = 2 resulting in a difference of 6-8 between the n th and 2n th ranked items. For all of the results discussed in this paper, we present more detailed results at this page 4 . The focus of this paper is on algorithm time; however, there are other important performance considerations such as file I/O. We did not attempt to optimize file I/O, but the times can be found on the site. In general, the XMT had a read I/O speed of about 300 MB/s and the UV approached 1 GB/s. On another XMT system with a richer I/O environment, we've seen speeds close to 4 GB/s. We feel that in either system, better I/O rates could have been obtained with further investment.
A more problematic overhead, at least for the UV, is that of data structure construction and priming. For the XMT, allocating memory for data structures presented a relatively (a) CO U (b) CO K Figure 3 : The results of performing the Cache-oblivious strategy on sets of integers of size five billion for both the Altix UV and the Cray XMT. Figure (a) shows the results of performing with the unknown key distribution approach, and (b) shows the results with a known key distribution. The horizontal axes displays the number XMT processors used or the number total threads employed on the Altix UV. The UV performs horribly except for the uniform random data, while the XMT does fairly well on all data sets, though scalability is best on uniform random. small overhead cost on the order of a few seconds. However, the UV has a demand paging system that exacts a significant penalty the first time a location in memory is accessed. We found that to get good scaling during execution of the algorithm, we needed to first prime the data structures by touching each location in memory. This process of priming did not scale, and took on the order of 2-10 times more than the algorithm run time for higher thread counts. Some of these data initialization overheads could possibly be paid for only once by creating an allocation pool containing preinitialized data structures from which processes could draw. We leave investigation of these concepts as future work. Figure 3 displays the results of running CO U and CO K on both the XMT and the UV. First examining performance on the uniform random data set for the UV, it can be seen that the straight-forward approach of CO U and CO K scales reasonably well to 32 threads, with a relative speedup of 28.3 and an efficiency of 0.89. Going out to 64 threads, the max number supported by hardware, only increases the speedup to 33.3 with an associated efficiency of 0.52. This decrease in scalability as we expand to the limits of the hardware is to be expected. When 64 threads are used, the two threads per core share an execution unit, and scalability generally lessens or levels out for most applications.
CO fails completely on the UV for the power law distributed sets. Performance is best on a single thread and does not scale thereafter. This is likely due to cache thrashing as all the threads compete to write to the sections of the table that contain the frequent keys.
For the XMT, the reader will notice that our results begin with 2-processor runs. This is due to a limitation in the file I/O setup that requires at least two Threadstorm processors to be running in order to perform I/O. We measure relative scalability and efficiency from the 2-processor times.
The XMT proves to be more resilient to hotspots in memory due to the frequent keys. This is likely because of the latency-tolerant nature of the processors, and due to the fact that single words can be transferred through memory instead of entire cache lines. However, as the frequent keys become more dominant, scalability suffers. For the uniform random data set, the XMT approaches a speedup of 58.2 (out of 64) for an efficiency of 0.91. This degrades to efficiencies of 0.37 for α = 0.25, 0.15 for α = 1, and 0.09 for α = 2. A new approach is warranted.
Note that performance is very similar regardless of whether a known or unknown key distribution strategy is being employed. The only notable difference is the swapping of the curves for the UV for α = 1, 2. We are unsure of why, for known key distributions, α = 2 performs better than α = 1.
Removing Heavy-Hitters
The RHH strategy avoids the weaknesses of the CO strategy by first removing frequently-occurring keys. The process is outlined by Algorithm 8 and is described in more detail below.
Besides a set of key-value pairs, KV , and an update function, the parameters of this algorithm include a fraction f ∈ (0, 1] that determines how many key-value pairs are sampled from KV . Lines 2-5 perform necessary setup before the hashing begins. Line 3 creates an empty array of key-value pairs the same size as KV that is needed later for buffering. nsamples is set to the number of samples each thread should perform. nlef t will be used as a global counter that will store how many key-value pairs remain after sampling and thread-local processing.
At line 6, the algorithm enters a parallel region with five for loops. Lines 7-8 perform a manual calculation of the beginning and end point for the contiguous block assigned to each thread. Line 9 calculates the step size to get the proper number of samples.
The first for loop between lines 11-13 performs a sampling of the data. In this first pass, we only perform the local-ClaimBucket function on a local table, in essence keeping track of which keys are in the sample.
The second for loop (15-23) then goes through and counts how many times each key occurs, but only if the key was sampled in the first loop. Any keys and their associated values that were not sampled are transferred into the KV 2 (a) RHH U (b) RHH K Figure 4 : The results of performing the Removing heavy-hitters strategy on sets of integers of size five billion for both the Altix UV and the Cray XMT. Figure (a) shows the results of performing with the unknown key distribution approach, and (b) shows the results with a known key distribution. The horizontal axes displays the number XMT processors used or the number total threads employed on the Altix UV. The XMT scales nearly linearly for all data sets, but the UV beats the XMT handedly on power law distributions.
array, but in a non-contiguous manner.
The third for loop (25-29) transfers all the key-value pairs from KV 2 back into KV , but this time forming a contiguous block. These key-value pairs are all those that are not contained in the local table. This step isn't strictly necessary. It is so we can repartition the remaining data and load-balance the work during the last loop.
The fourth loop (30-35) iterates through the content of the local table, and adds the entries to the global table. Loops 1-4 are what allow this method to scale well for power law distributions. All of the frequently occurring keys are largely addressed locally in a cache-friendly manner. When the update from the local table to the global table occurs, the number of operations for frequently occurring keys is on the order of the number of threads, rather thenn on the order of the count of the key, drastically reducing the amount of cache thrashing. At the end, the fifth loop (38-41) addresses any remaining keys that were not caught by the sampling procedure. Figure 4 shows dramatic improvement for the UV under the RHH strategy. Now the UV is beating out the XMT for the power law distributed data on a thread to processor basis, and continuing to perform even with XMT on the uniform random data. For power law distributions, the UV is performing between 3.4 to 17.5 times better than the XMT on a thread to processor basis. Note that the performance is nearly identical with either a known or unknown key distribution. The relative scaling to 32 processors for the UV has a high of around 21-22 for α = 0.25, but then decreases as α increases, coming to 10-11 for α = 2. Another noticeable pattern is the increase in time as α becomes lower. The reason for this is that, as α becomes smaller, the number of keys increases, necessitating an increase in table size and a decrease in overall cache-friendliness. The UV's performance increased dramatically for power law distributions, Table 1 : Listed here are the relative scaling and efficiencies obtained using the RHH method. The relative scaling for the UV is a comparison between the one thread time to the 32 thread time. The XMT is a comparison between the two processor time to the 128 processor time.
Performance
but the XMT's performance also improved. The improvement ranges from around a 10% decrease in time for low processor counts, to nearly a 90% improvement for 128 processors, due to the greater scalability of the RHH strategy.
While performance improved significantly under RHH for both the XMT and the UV on power law distributions, it actually suffered under uniform random distributions. For example, the times for RHH K increased on average by a factor of 49% on the UV, while the XMT incurred a 90% increase. This is due to the added computational complexity. Both are, of course, O(|KV |), but the constant differs. We'll omit a formal count on the number of operations, but we can perform a rough count on the number of iterations for a uniform random data set.
CO has one main for loop with |KV | iterations. RHH, on the other hand, has a parallel region with five for loops. The first for loop iterates over a sample of the data and requires f |KV | iterations. The second for loop iterates over the entire data set, equating to |KV | iterations. The third for loop iterates over all the key-value pairs not caught by the sampling. For the uniform random data set, which has virtually no duplicates, this means (1−f )|KV | iterations are required. The fourth for loop iterates over the size of the local table. We set our local tables to be of size 2f |KV |/|T |. Thus, for all threads, the number of iterations is 2f |KV |. Finally, in the last loop, we iterate over the set of remaining key-value pairs, which comes again to (1 − f )|KV |. So, in the end, RHH requires (3 + f ) times more iterations than CO. The reason RHH performs significantly better for power law distributions is due to the fact that RHH exploits the NUMA aspects of the machine: the cache hierarchy for the UV, and the data buffers of the XMT.
A simple compromise to get the best of both worlds would be to sample the data to understand the distribution. If it appears uniform random with no duplicate keys, the process can continue with CO, thus incurring |KV |(1+f ) iterations. If the data, by statistical measures, appears to have frequent keys and approaches a power law distribution, the process can continue with the RHH strategy. A simple heuistic that will likely work well in practice is the following: First find the mode from the sample, mo, and estimate the probability of finding mo from a random selection within the array to be p(mo) = |mo| f |KV | where |mo| is the number times mo was found in the sample. Then we can estimate the number of times that t threads will collide on updating mo with the expected value of a binomial distribution, namely t · p(mo). One can then find empirically a threshold, τ , such that for t · p(mo) > τ , the RHH strategy is warranted.
Before concluding, we examine the spatial complexity of each of the approaches. Table 2 lists the amount of memory needed for each method. The variables ks, vs, and ls are the size of the keys, values, and locks, respectively. For our experiments keys and values are 8-byte integers. Locks for the UV are 4-bytes, while the XMT requires 8-bytes. |table| is the length of the hash table. For uniform random data sets, the table size is on the order of |KV |, the length of the key-value array. For power law distributions, the table size can be much smaller as many keys repeat within the distribution. The 2f |KV | term in the RHH methods relates to the amount of space needed for the thread-dedicated tables. Each local table is of length 2f |KV |/|T |, where f is the sample fraction, |T | is the number of threads, and 2 is used to ensure appropriate load factors on the table. Thus, the length of all the local tables in their entirety comes to 2f |KV |. The last term |KV | is the buffer used to gather remaining keys together after frequent keys had been removed.
Method
Space CO K ksvs|table| CO U ksvsls|table| RHH K ksvs (|table| + 2f |KV | + |KV |) RHH U ksvs (ls|table| + 2lsf |KV | + |KV |) 
CONCLUSIONS
The RHH strategy for hashing, either for known key distributions or unknown key distributions, scales well on both the XMT and the UV. Also, this scalability is exhibited for both uniform random and more difficult power law distributions.
For the size of problem we considered, that of five billion integers, the UV appears to be the clear winner, especially when we consider that only 4 Xeon processors were needed to topple the might of 128 Threadstorm processors. The RHH strategy effectively exploits the NUMA aspects of the machine, allowing the cache to remain stable for most of the computation. However, the XMT exhibits greater scalability and efficiency. We are curious if, for larger systems and problems, the XMT will eventually surpass the performance of the UV.
