A sketch is a probabilistic data structure that is used to record frequencies of items in a multi-set. Various types of sketches have been proposed in literature and applied in a variety of fields, such as data stream processing, natural language processing, distributed data sets etc. While several variants of sketches have been proposed in the past, existing sketches still have a significant room for improvement in terms of accuracy. In this paper, we propose a new sketch, called Slim-Fat (SF) sketch, which has a significantly higher accuracy compared to prior art, a much smaller memory footprint, and at the same time achieves the same speed as the best prior sketch. The key idea behind our proposed SF-sketch is to maintain two separate sketches: a small sketch called Slim-subsketch and a large sketch called Fat-subsketch. The Slim-subsketch, stored in the fast memory (SRAM), enables fast and accurate querying. The Fat-subsketch, stored in the relatively slow memory (DRAM), is used to assist the insertion and deletion from Slim-subsketch. We implemented and extensively evaluated SF-sketch along with several prior sketches and compared them side by side. Our experimental results show that SF-sketch outperforms the most commonly used CM-sketch by up to 33.1 times in terms of accuracy.
I. INTRODUCTION

A. Background and Motivation
A sketch is a probabilistic data structure that is used to record frequencies of distinct items in a multi-set. Due to their small memory footprints, high accuracy, and fast speeds of queries, insertions, and deletions, several types of sketches are being extensively used in data stream processing [1] , [2] , [3] , [4] , [5] , [6] , [7] . Sketches are also being applied in other fields, such as computing approximate association scores like point-wise mutual information [8] - [10] , sparse approximation in compressed sensing [11] , identifying heavy hitters [12] , network anomaly detection [13] , and processing distributed data sets [14] , and natural language processing [15] - [17] . The two key performance metrics of sketches are accuracy and query speed. The accuracy quantifies how close the value of the frequency estimated from the information stored in the sketch is to the actual value of the frequency. Query speed measures how long it takes a sketch to estimate the frequency of a given item. To achieve fast query speed, the sketch has to be stored in the fast SRAM because SRAM is about ten times faster than DRAM, albeit more expensive and limited in size. This paper focuses on the design of a new sketch that not only has a much smaller memory footprint compared to the existing sketches, but is also more accurate while achieving the same query speed as the best prior sketch.
B. Limitations of Prior Art
Charikar et al. proposed the Count sketch (C-sketch) [18] . C-sketch experiences two types of errors: over-estimation error, where the result of a query is a value larger than the true value, and under-estimation error, where the result of a query is a value smaller than the true value. Improving on the C-sketch, Cormode and Muthukrishnan proposed the Count-min (CM) sketch [19] , which does not suffer from the under-estimation error, but only from the over-estimation error. In a further enhancement, Cormode et al. proposed the conservative update (CU) sketch [20] , which improves the accuracy at the cost of not supporting item deletions, i.e., once the information about an item is inserted into the CU-sketch, it cannot be removed from the sketch without affecting the information of the other items in the sketch. CML-sketch [21] further improves the accuracy at the cost of suffering both over-estimation and under-estimation errors. Because CM-sketch supports deletions and does not have under-estimation error, it is still the most popular sketch in practice. The design goal of this paper is to significantly improve the accuracy of CM-sketch while keeping its advantages.
C. Proposed Approach
In this paper, we present a new sketch, called the Slim-Fat (SF) sketch, which achieves significantly higher accuracy and significantly smaller SRAM memory footprint compared to prior art while supporting deletions and achieving the same query speed as the widely used CM-sketch. Any sketch has four primary operations: initialization, insertion, deletion (for sketches that support deletions), and query. Initialization refers to the operation of setting all counters in a sketch to 0 before using the sketch. Insertion and deletion refer to the operations of incrementing and decrementing, respectively, the frequency of a given item in the sketch by 1. Query refers to the operation of estimating current frequency of a given item using the information stored in the sketch.
Before describing our approach, we first briefly describe how the conventional CM-sketch works because several design choices in SF-sketch are built on the CM-sketch. As shown in Figure 1 , a CM-sketch consists of d arrays, where we represent the i th array with A i . Each array consists of w buckets and each bucket contains one counter. We represent the counter in the j th bucket of the i th array with A i [j]. Each array A i , where 1 i d, is associated with an independent hash function h i (.), whose output is uniformly distributed in the range [1, w] . In CM-sketch, the initialization operation is simply to set all counters A i [j] to zero, where 1 i d and 1 j w. To insert an item e, i.e., to increment its frequency stored in the sketch by 1, the CM-sketch computes the d hash functions h 1 (e), h 2 (e), . . . , h d (e) and increments the d counters A 1 [h 1 (e)], A 2 [h 2 (e)], . . . , A d [h d (e)] by 1. To delete an item e, the CM-sketch computes d hash functions and decrements the d counters A 1 [h 1 (e)], A 2 [h 2 (e)], . . . , A d [h d (e)] by 1. When querying the frequency of an item e, the CM-sketch computes the d hash functions and returns the value of the smallest counter among A 1 [h 1 (e)], A 2 [h 2 (e)], . . . , A d [h d (e)] as the answer to the query. Note that the value returned by the CM-sketch in response to the query for the frequency of an item will never be smaller than the true value of its frequency. Consequently, CM-sketch does not suffer from the under-estimation error, but only from the over-estimation error [19] . By carefully selecting the values of d and w and based on the distribution of items in the data stream, the over-estimation error can be estimated and bounded apriori. e : an item h i : a hash function : a counter : a bucket Fig. 1 . The Count-min sketch architecture.
The key idea behind our proposed SF-sketch is to maintain two separate sketches, one in SRAM called Slim-subsketch and another in DRAM called Fat-subsketch. Slim-subsketch, as the name suggests, has significantly fewer counters compared to the Fat-subsketch. The motivation behind keeping the small Slim-subsketch is to increase the query speed while consuming very little SRAM memory. The motivation behind keeping the Fat-subsketch is to assist the Slim-subsketch during updates so as to make the accuracy of the Slim-subsketch as high as possible. Fat-subsketch uses many more counters compared to Slim-subsketch because DRAM is cheap and large enough. In designing our SF-sketch, we start with a bare bones version of the sketch and make improvements step by step to arrive at its final design. In this process, we present five versions of SF-sketch, namely SF 1 -sketch through SF 4 -sketch, and SF Fsketch, where each version improves upon the previous versions by addressing some of the limitations in those versions. The version SF F -sketch is the final design of our SF-sketch.
In SF 1 -sketch, the SRAM Slim-subsketch consists of d×w counters and the DRAM Fat-subsketch consists of d×w counters, where w > w. The Fat-subsketch in SF 1 -sketch is the same as a standard CM-sketch. The key idea behind the design of SF 1 -sketch is that, when inserting an item e, if the value of a counter to which the hash function h i (e) points is already greater than the real frequency of the item e, then incrementing that counter will only degrade the accuracy. Specifically, when inserting an item e, we first insert it into the Fat-subsketch using the insertion operation of the standard CM-sketch and then query its current frequency from this Fat-subsketch using the query operation of the standard CM-sketch. Suppose the Fat-subsketch estimates its current frequency to be c. Next, we compute d hash functions corresponding to the d arrays of the Slim-subsketch, and retrieve the values of the d counters from the Slim-subsketch to which the hash functions point. If all d counters are less than or equal to the estimate c, we increment the smallest counter(s) by 1 in the Slim-subsketch; otherwise, we do nothing. When multiple counters have the smallest value, we increment all of them by 1. The motivation behind incrementing only the smallest counter(s) in the Slimsubsketch is twofold. First, by reducing the number of counters that we increment, the over-estimation error reduces. Second, as fewer counters are incremented, their sizes can be reduced, which reduces the SRAM footprint of the Slim-subsketch.
Unfortunately, the SF 1 -sketch does not support deletions. To support deletions, the second version, SF 2 -sketch, maintains an additional DRAM sketch, namely the Deletion-subsketch. The Deletion-subsketch in SF 2 -sketch is the same as a standard CM-sketch. It contains equal number of arrays and counters per array as the Slim-subsketch and uses the same d hash functions as the hash functions used for Slim-subsketch. Every time SF 2 -sketch inserts an item in the Fat-and Slimsubsketches as described for the SF 1 -sketch, it also inserts it into the Deletion-subsketch using the insertion operation of the standard CM-sketch. As both Deletion-and Slim-subsketch use the same hash functions, each counter of the Slim-subsketch is always less than or equal to the corresponding counter of the Deletion-subsketch. This property enables SF 2 -sketch to support deletions. More specifically, when deleting an item, we first apply the deletion operation of the standard CM-sketch on the Deletion-subsketch. After the deletion operation, if one or more counters in the Deletion-subsketch become smaller than the corresponding counters in the Slim-subsketch, we decrement those counters in the Slim-subsketch such that no counter in the Slim-subsketch stays greater than the corresponding counter in the Deletion-subsketch. SF 2 -sketch supports both insertions and deletions at the cost of maintaining two DRAM sketches, i.e., the Fatsubsketch and the Deletion-subsketch. To reduce the DRAM usage, the subsequent versions of the SF-sketch develop novel techniques to combine the functionality of these two DRAM sketches into a single DRAM sketch. Furthermore, we develop techniques to store all required information in that single sketch in such a way that the number of memory access required to access the information is minimal. This leads to high insertion and deletion speeds.
D. Technical Challenges
In designing the SF-sketch, we faced several technical challenges. Next, we describe two of the most important ones. The first technical challenge is to achieve a significantly higher accuracy compared to the CM-sketch, which is currently the most widely used sketch. To address this challenge, we leverage our novel insight that if we reduce the number of counters that are incremented for each insertion, the accuracy will improve because the extent of over-estimation error will decrease. When inserting a new item, our proposed sketch does not always increment d counters in the Slim-subsketch, rather increments only the minimum number of counters that need to be incremented to avoid under-estimation error. Note that the query is only processed based on the information stored in the SRAM Slim-subsketch, which is why we focus on minimizing the number of counter increments per insertion only in the Slim-subsketch. To determine exactly which counters to increment in the Slim-subsketch, our SF-sketch makes use of the Fat-subsketch, which enables it to estimate the number of times the item has already been inserted. SF-sketch then either increments only the smallest counters in the Slim-subsketch if the value of the smallest counters is less than this estimated value or increments no counter at all.
The second technical challenge is to enable Slim-subsketch to support deletions. It is very difficult to achieve accurate deletions from the Slim-subsketch because to support deletions, one needs to keep track of exactly which counters were incremented when each item was inserted. This information is required to identify the appropriate counters to decrement when deleting an item and to identify the influence of those decrements on other items. Tracking such information is very expensive, both in terms of memory overhead and computational cost. To address this challenge, instead of achieving accurate deletions, i.e., decrementing all those counters that were incremented at the time of inserting the given item, we achieve approximate deletions, i.e., decrementing as many counters in the Slim-subsketch as possible without causing any under-estimation errors.
E. Key Contributions
1)
We propose a new sketch, namely the SF-sketch, which has higher accuracy compared to the prior art while supporting deletions and keeping the query speed unchanged. 2) We implemented C-sketch, CM-sketch, CU-sketch, CML-sketch and SF-sketch on GPU and multi-core CPU platforms. We carried out extensive experiments on these two platforms to evaluate and compare the performance of all these sketches. Experimental results show that SF-sketch outperforms CM-sketch by up to 33.1 times in terms of average relative error.
II. RELATED WORK
The structure of the Count sketch (C-sketch) [18] proposed by Charikar et al. is exactly the same as the CM-sketch [19] described earlier except that each array A i is associated with two hash functions h i (.) and δ i (.). Each hash function h i (.) is uniformly distributed with the output in the range [1, w] . Each hash function δ i (.) evaluates to -1 or +1 with equal probability. Any pair of hash functions h i (.) and h j (.), where i = j, are pairwise independent. Similarly, hash functions δ i (.) and δ j (.), where i = j, are also pairwise independent. To insert an item e, for all values of i ∈ [1, w], C-sketch calculates hash functions h i (e) and δ i (e) and adds δ i (e) to the counters A i [h i (e)]. When querying the frequency of item e, C-sketch reports the median of {A 1 [h 1 (e)]×δ 1 (e), A 2 [h 2 (e)]×δ 2 (e) . . . A d [h d (e)]×δ d (e)} as an estimate of the frequency of the item e.
Unfortunately, C-sketch suffers from both over-estimation and under-estimation errors. Therefore, several improvements, which do not suffer from the under-estimation errors, have been proposed such as the CM-sketch [19] , CU-sketch [20] , and Count-Min-Log (CML) sketch [21] . Estan and Varghese proposed the CU-sketch which can be combined with other sketches. For convenience, CU-sketch means CM-CU-sketch when it is combined with CM sketch. CU-sketch has d arrays of w counters each [20] . To insert an item e in CU-sketch, the sketch increments only the smallest counter(s) among the d counters that the d hash functions map the item e to. Although CU-sketch improves the query accuracy significantly, its fundamental limitation is that it does not support deletions because to support deletions from the CU-sketch, one needs to keep track of the counters that are incremented at each insertion. The CU-sketch does not perform such a tracking. If we apply the deletion operation of the CM-sketch on the CU-sketch, i.e., first compute the d hash functions h 1 (e), h 2 (e), . . . , h d (e) and then decrement the d counters A 1 [h 1 (e)], A 2 [h 2 (e)], . . . , A d [h d (e)] by 1, subsequent query results from the resulting CU-sketch will be prone to having under-estimation errors in addition to over-estimation errors. As CU-sketch does not support deletions, it has not received as wide acceptance in practice as the CM-sketch. CML-sketch is another variant of the CM-sketch that uses logarithm-based approximate counters instead of linear counters [21] . Instead of incrementing one counter per array per insertion, it decides whether or not to increment the counters each time with logarithmic probabilities. This helps in reducing the number of bits for each counter, which in turn allows the sketch to have more counters in the same amount of memory and thus achieve better accuracy. Unfortunately, CML-sketch suffers from both over-estimation and under-estimation errors, and its final version does NOT support deletions. Thorough statistical analysis of various sketches is provided in [22] , [23] .
A recent work presented Augment sketch (A-sketch), which is a universal framework that can be applied to many existing sketches, especially to those with low accuracy [6] . A-sketch uses a filter to catch heavy hitters (high-frequency items) earlier, and uses classical sketches (such as CM-sketch and C-sketch) to store and query the rest items. In this way, the accuracy can be greatly improved, and low-frequency items and high-frequency items can hardly be misclassified. However, always keeping the most frequent items in the first filter without incurring additional errors is a challenging issue. Therefore, complex design and frequent communications between the two filters are unavoidable, making the implementation complicated. Indeed, A-sketch can be applied to our SF-sketch as well. However, according to our tests, as our SFsketch is already very accurate, directly combining A-sketch with SF-sketch does not bring a notable increase in accuracy but it does bring more complexity.
Another class of data structures that can be used to store frequencies of items are the enhanced Bloom filters, such as Spectral Bloom Filters (SBF) [24] and Dynamic Count Filters (DCF) [25] , which indeed can estimate frequencies of items. SBF replaces each bit in the conventional Bloom filter with a counter [24] . To insert an item, the basic version of SBF simply increments all the counters that the item maps to. When querying the frequency of an item, SBF returns the value of the smallest counter(s) among all the counters to which the hash functions map the item to as the estimate of the frequency of that item in the multiset. DCF extends the concept of SBF while improving the memory efficiency of SBF by using two separate filters [25] . The first filter is comprised of fixed size counters while the size of counters in the second filter is dynamically adjusted. The use of two filters, unfortunately, increases the complexity of DCF, which degrades its query and update speeds.
III. THE SLIM-FAT SKETCH
In this section, we present the details of our SF-sketch. To better explain the intuition at work behind the SF-sketch and to justify the design choices we made in developing the SF-sketch, we will start with a basic version and improve it incrementally to arrive at the final design. For each intermediate version of the SF-sketch that we develop while working our way towards the final design, we will first describe its insertion, query, and deletion operations. After that we will discuss its limitations, which will guide us in making our design choices for the next version. In this process, we will present five different versions of SF-sketch, which we name SF 1 -sketch through SF 4 -sketch, and finally SF F -sketch, which is our final design. Each version is developed by studying the limitations of its predecessor version and addressing them.
In our slim-fat architecture (shown in Figure 2 ), there is a set of arrays with fewer counters per array called a Slim-subsketch, and a set of arrays with comparatively more counters per array called a Fat-subsketch. The Slim-subsketch resides in the fast SRAM memory, while the Fat-subsketch resides in the comparatively slower DRAM memory. When inserting or deleting an item, we first update the Fat-subsketch, and then update the Slim-subsketch based on the observations we make from the Fat-subsketch. The key insight at work behind our proposed scheme is that, when inserting an item, if the value of any counter in the Slim-subsketch to which the incoming item maps is already greater than the number of times that item has already been inserted, then incrementing that counter only degrades the accuracy during the query operation. Therefore, such a counter should not be incremented. The Fatsubsketch enables us to determine whether such a counter in the Slim-subsketch is already greater than the number of times the item has already appeared. Next, we start with the first version of our slim-fat sketch, i.e., the SF 1 -sketch, and discuss its operations and limitations, which will pave the way towards the design of SF 2 -sketch and its subsequent versions. Table I summarizes the symbols and abbreviations used in this paper. A. SF 1 : Optimizing Accuracy Using One DRAM Subsketch
As shown in Figure 2 , SF 1 -sketch consists of d arrays in both the SRAM Slim-subsketch and the DRAM Fat-subsketch. The Fat-subsketch is exactly a standard CM-sketch with many more counters than the Slim-subsketch. We represent the i th array in the Slim-subsketch with A i and in the Fat-subsketch with B i . Each array in the Slim-subsketch consists of w buckets while each array in the Fat-subsketch consists of w buckets, where w > w. Furthermore, each bucket in both Slim and Fat-subsketches contains one counter. We represent the counter in the j th bucket of the i th array in the Slim-subsketch with A i [j], where 1 i d and 1 j w. Similarly, we represent the counter in the k th bucket of the i th array in the Fatsubsketch with B i [k], where 1 i d and 1 k w . Each array A i is associated with a uniformly distributed independent hash function h i (.), where the output of h i (.) lies in the range [1, w] . Similarly, each array B i is associated with a uniformly distributed independent hash function g i (.), where the output of g i (.) lies in the range [1, w ] . The structure of the SF 1 -sketch is shown in Figure 2 . The initialization operation for the SF 1sketch consists of simply setting all counters A i [j] and B i [k] to zero, where 1 i d, 1 j w, and 1 k w .
FC Sketch
Complementary Sketch
Slim Sketch Fat Sketch
Slim Sketch Fat sketch Insertion: When inserting an item, the SF 1 -sketch first inserts it into the Fat-subsketch, and based on the observations made from the Fat-subsketch, increments appropriate counters in the Slim-subsketch. The insertion operation in the Fatsubsketch is exactly the same as the conventional CM-sketch.
To insert an item e into the Fat-subsketch, we first compute the d hash functions g 1 (e), g 2 (e), . . . , g d (e) and increment the d counters
After inserting the item, we estimate its current frequency of e by finding the minimum value among all counters we just incremented and represent it with B min e . To insert the item e into the Slim-subsketch, we compute the d hash functions and identify the smallest counter(s) among the d counters
If the smallest counter(s) are not smaller than B min e , insertion operation ends. Otherwise, we increment the smallest counter(s) by 1. Note that CUsketch always increments the smallest counter(s). Thus SF 1sketch is much more accurate than CU-sketch. In other words, ∀l ∈ [1, d], SF 1 -sketch increment all counters A l [h l (e)] by one that satisfy the following two conditions:
, and A l [h l (e)] < B min e . Query: When querying the frequency of item e, the SF 1sketch computes the d hash functions h 1 (e), h 2 (e), . . . , h d (e), and returns the value of the smallest counter among
as the result of the query. Note that the query is entirely answered from the SRAM Slimsubsketch, which makes the query operation very fast.
Deletion: SF 1 -sketch does not support deletions.
Advantages and Limitations:
The key advantage of the SF 1sketch is that to answer a query it does not access the DRAM Fat-subsketch, but only accesses the SRAM Slim-subsketch, which keeps the query speed of this sketch as fast as the conventional CM-sketch. Furthermore, note that during the insertion operation, we either increment no counters or increment only the smallest counter(s) in the Slim-subsketch. The smallest counter in the Fat-subsketch gives the upper bound on the number of times that a given item has already been inserted. This strategy reduces the number of increments in the Slim-subsketch, which has two advantages. First, it reduces the memory footprint of the Slim-subsketch on the expensive and limited SRAM memory. Second, due to fewer increments, the over-estimation error is reduced. Unfortunately, the biggest limitation of the SF 1 -sketch is that it does not support deletions from the Slim-subsketch. While the Fat-subsketch assists the Slim-subsketch during insertion operation, it cannot assist in the deletion operation because the numbers of counters per array in the Fat-and Slim-subsketches are not the same. This inability to support deletions from the Slim-subsketch limits the practical usability of the SF 1 -sketch. In the next version of our SF-sketch, i.e., the SF 2 -sketch, we address this limitation while keeping the advantages of the SF 1 -sketch. Difficulties for deletions: It is challenging to achieve accurate deletions in SF 1 -sketch because to delete items from the Slimsubsketch of SF 1 -sketch, one has to keep track of exactly which counters were incremented when inserting each item. Such tracking is difficult and requires large memory and processing overhead. We explain this with help of an example. As shown in Figure 3 , consider a Slim-subsketch that has two arrays and two counters per array, where all counters are initialized to 0. Let we first insert two items e 1 and e 2 and then delete the item e 1 . Furthermore, let e 1 maps to [2] is 0, we only increment the smaller of the two, i.e., A 2 [2] to 1. At this point,
In deleting e 1 , as e 1 maps to both A 1 [1] and A 2 [1] and as both were incremented at the time of inserting e 1 , if we decrement them both, the query result of e 2 will be 0, i.e., an under-estimation error occurs, which we do not want in our SF-sketch.
Deletion-subsketch: To support deletions, in addition to one SRAM Slim-subsketch and one DRAM Fat-subsketch just like in the SF 1 -sketch, the SF 2 -sketch maintains another sketch in DRAM, called the Deletion-subsketch. The Deletionsubsketch is essentially a standard CM-sketch. Unlike Fatsubsketch, all the parameters (d, w, h i (.)) of the Deletionsubsketch and the Slim-subsketch are exactly the same. For the Deletion-subsketch, we represent the counter in the j th bucket of the i th array with C i [j], where 1 i d and 1 j w. Note that the Fat-subsketch helps in deciding which counters to increment in the Slim-subsketch while inserting an item, whereas the Deletion-subsketch helps in deciding which counters to decrement in the Slim-subsketch when deleting an item. The initialization operation for the SF 2 -sketch consists of simply setting all counters
, and C i [j], to 0 (1 i d, 1 j w, and 1 k w .)
Insertion: The insertion operation of the SF 2 -sketch for the Slim-and Fat-subsketches is exactly the same as that of the SF 1 -sketch, except that for the SF 2 -sketch, we also add information about the incoming item to the Deletion-subsketch. Specifically, to insert an item e into the Deletion-subsketch, we compute d hash functions and increment the d counters
Query: The query operation of the SF 2 -sketch is exactly the same as the SF 1 -sketch.
Deletion: To delete an item e from the SF 2 -sketch, we first delete it from the Fat-subsketch by decrementing the d counters B 1 [g 1 (e)], B 2 [g 2 (e)], . . . , B d [g d (e)] by 1 and then delete it from the Deletion-subsketch by decrementing the d counters C 1 [h 1 (e)], C 2 [h 2 (e)], . . . , C d [h d (e)] by 1. Finally, we delete it from the Slim-subsketch. We leverage the fact that before deleting the item from the Deletion-subsketch, each counter in the Slim-subsketch is always less than or equal to the corresponding counter in the Deletion-subsketch, because when inserting an item, even if a counter in the Slim-subsketch to which the incoming item maps to is not incremented, the corresponding counter in the Deletion-subsketch is always incremented. To delete the item e from the Slim-subsketch,
Advantages and Limitations: The SF 2 -sketch is advantageous over the SF 1 -sketch because it supports deletions. However, it is not efficient in terms of memory usage and update speed because it has to maintain an additional sketch, the Deletion-subsketch, to support deletions from the Slimsubsketch. In the next version of the SF-sketch, i.e., the SF 3sketch, we address this limitation while keeping the advantages of both SF 1 -and SF 2 -sketches.
C. SF 3 : Supporting Deletion Using One DRAM Subsketch
In SF 3 -sketch, we get rid of the separate Deletionsubsketch, and modify the Fat-subsketch so that, in addition to insertions, it can assist deletions in the Slim-subsketch. The Fat-subsketch in the SF 3 -sketch is similar to the Fatsubsketch in the SF 1 -and SF 2 -sketches. However, in the Fatsubsketch of SF 3 -sketch, the number of buckets in each array is given by w = z × w, where z is a positive integer. In other words, the DRAM Fat-subsketch consumes z times as much memory as the SRAM Slim-subsketch. The structure of the Slim-subsketch in the SF 3 -sketch is exactly the same as the Slim-subsketches in the SF 1 -and SF 2 -sketches. However, the hash functions h i (.), where 1 i d, associated with the Slim-subsketch are now derived from the hash functions g i (.), where the output of g i (.) lies in the range [1, z × w]. More specifically,
Consequently, the value of the hash function h i (.) always lies in the range [1, w] , where w is the number of buckets per array in the Slim-subsketch. Note also that calculating the hash function h i (.) from the hash function g i (.) using the equation above essentially associates each counter
in the Fat-subsketch. Every time a counter in the Slim-subsketch is incremented, it is certain that one of its associated z counters in the Fat-subsketch is also incremented. This further means that the value of a counter in the Slim-subsketch will always be less than or equal to the sum of values of all its associated counters in the Fat-subsketch.
Insertion: When inserting an item e, the SF 3 -sketch first inserts it into the Fat-subsketch. For this we compute the d hash functions g 1 (e), g 2 (e), . . . , g d (e) and increment the 
B min e , we do nothing. Query: The query operation of SF 3 -sketches is exactly the same as that of SF 1 -and SF 2 -sketches.
Deletion: To delete an item from the SF 3 -sketch, we first delete it from the Fat-subsketch and then from the Slimsubsketch. To delete the item e from the Fat-subsketch, we first calculate the d hash functions g 1 (e), g 2 (e), . . . , g d (e) and then decrement the d counters B 1 [g 1 (e)], B 2 [g 2 (e)], . . . , B d [g d (e)] by 1. To delete the item e from the Slim-subsketch, we leverage the fact stated earlier that before deleting the item from the Fat-subsketch, the value of a counter in the Slimsubsketch is always less than or equal to the sum of values of all its associated counters in the Fat-subsketch, because when inserting an item, even if a counter in the Slimsubsketch is not incremented, one of the associated counters in the Fat-subsketch is always incremented. To delete the item e from the Slim-subsketch, after deleting it from the Fatsubsketch, for each i ∈
Note that each value of h i (e) is calculated using Equation (1).
Advantages and Limitations:
The advantage of SF 3 -sketch over the SF 2 -sketch is that it does not have to maintain a separate Deletion-subsketch. Unfortunately, it is not efficient in terms of deletion speed because to delete an item, it needs d × z memory accesses to add the counters in each array of the Fat-subsketch. In the next version of our SF-sketch, i.e., the SF 4 -sketch, we address this limitation while keeping the advantages of all three previous versions of the SF-sketch.
D. SF 4 : Improving Deletion Speed
In SF 4 -sketch, we modify the Fat-subsketch so that instead of each bucket having one counter, each bucket has z counters. As shown in Figure 4 , in the Fat-subsketch of the SF 4 -sketch, we have d arrays with w = w buckets each, and each bucket now contains z counters instead of one counter. We represent the k th counter in the j th bucket of the i th array in the Fatsubsketch with B i [j][k], where 1 i d, 1 j w, and 1 k z. Each array B i in the Fat-subsketch is associated with two uniformly distributed independent hash functions: h i (.) with output in the range [1, w] , which maps an item to a bucket in the i th array, and f i (.) with output in the range [1, z] , which maps an item to a counter inside the bucket B i [h i (.)] of the i th array. The Slim-subsketch uses the same has functions h i (.) as the Fat-subsketch to map items to buckets. Every time a counter in the Slim-subsketch is incremented, it is certain that one of the counters among the z counters in the corresponding bucket of the Fat-subsketch is also incremented. This means that the value of a counter in the Slim-subsketch will always be less than or equal to the sum of the values of all counters in the corresponding bucket in the Fat-subsketch.
Insertion: When inserting an item, the SF 4 -sketch first inserts it into the Fat-subsketch, and based on the observations it makes from the Fat-subsketch, increments appropriate counters in the Slim-subsketch. Specifically, to insert an item e into the Fat-subsketch, we first compute d hash functions h 1 (e), h 2 (e), . . . , h d (e) and another d hash functions f 1 (e), f 2 (e), . . . , f d (e) and increment the d counters
Next, we find the minimum value among all counters we just incremented and represent it with B min e . To insert the item e into the Slim-subsketch, we identify the counters with the smallest value among the d counters
and increment them by 1 only if their values are less than B min e . In other words, we increment all counters A l [h l (e)] by one that satisfy the con-
B min e , we do nothing. Query: The query operation of the SF 4 -sketch is exactly the same as the SF 1 -, SF 2 -, and SF 3 -sketches.
Deletion: To delete an item from the SF 4 -sketch, we first delete it from the Fat-subsketch and then from the Slim-subsketch. To delete the item e from the Fat-subsketch, we first calculate the d hash functions h 1 (e), h 2 (e), . . . , h d (e) and another d hash functions f 1 (e), f 2 (e), . . . , f d (e) and decrement the d counters To delete the item e from the Slim-subsketch, we leverage the fact stated earlier that before deleting the item from the Fat-subsketch, the value of a counter in the Slim-subsketch is always less than or equal to the sum of values of all counters in the corresponding bucket in the Fat-subsketch. To delete the item e from the Slim-subsketch, after deleting it from the Fat-subsketch, for each i ∈
. Therefore, one deletion from the Fat-subsketch only needs d × z × b/W memory accesses, where b is the number of bits of each counter, W is the size of the machine word, and b < W .
Advantages and Limitations:
The principles behind the SF 4 -sketch and the SF 3 -sketch are essentially the same. The advantage SF 4 -sketch has over SF 3 -sketch is that all counters in the Fat-subsketch corresponding to a counter in the Slimsubsketch are now located in the same bucket. Thus, adding the values of the z counters usually only takes a single memory access. Based on SF 4 -sketch, our final version SF F -sketch aims to minimize the over-estimation error.
E. SF F : Reducing Over-Estimation Error (The Final Version)
The structure of the SF F -sketch is exactly the same as SF 4sketch. The key idea behind the SF F -sketch is that in updating the counters in the Slim-subsketch, we keep the value of each counter in the Slim-subsketch always less than or equal to the value of the largest counter in the corresponding bucket of the Fat-subsketch during insertion and deletion operations. Next, we describe how insertion, deletion, and query operations work in SF F -sketch followed by an analysis of its error and accuracy.
Insertion: The insertion operation of the SF F -sketch is exactly the same as the insertion operation of the SF 4 -sketch.
Query: The query operation of the SF F -sketch is exactly the same as the previous versions of the SF-sketch.
Deletion: To delete an item from the SF F -sketch, we first delete it from the Fat-subsketch and then from the Slim-subsketch. To delete an item e from the Slim-subsketch, we first check the d buckets
[k] changes when deleting item e from the Fat-subsketch, we set
. Otherwise, we leave the value of A i [h i (e)] unchanged. The key difference between the deletion operation of SF F -sketch and SF 4 -sketch is that in SF F -sketch, we compare the value of
, which results in significantly reducing the values of counters in the Slim-subsketch.
Advantages:
The key advantage of SF F -sketch over SF 4sketch is that during deletion operation, it significantly reduces the counter values in the Slim-subsketch because in SF Fsketch, we compare the values of counters in the Slimsubsketch with the values of the largest counters in the corresponding buckets of the Fat-subsketch instead of comparing them with the sum of the values of all counters in the corresponding buckets of the Fat-subsketch. This significantly reduces the over-estimation error of SF F -sketch. Note that SF Fsketch does not suffer from under-estimation error.
Bound on Over-estimation Error: As a query is entirely answered from the Slim-subsketch, the over-estimation error of SF F -sketch is actually the over-estimation error of the Slimsubsketch. Therefore, next, we calculate the over-estimation error of the Slim-subsketch of the SF F -sketch. Let α represent the average number of counters in any given array of the Slimsubsketch that are incremented per insertion. Note that for the standard CM-sketch, the value of α is equal to 1 because in the standard CM-sketch, exactly one counter is incremented in each array when inserting an item. For the Slim-subsketch in the SF F -sketch, α is less than or equal to 1 because the Fat-subsketch helps in reducing the number of counters that are incremented in the Slim-subsketch per insertion. For any given item e, let f (e) represent its actual frequency and let f (e) represent the estimate of its frequency returned by the Slim-subsketch of the SF F -sketch. Let N represent the total number of insertions of all items into the SF F -sketch. Let h i (.) represent the hash function associated with the i th array of the Slim-subsketch, where 1 i d. Let X i,(e) [j] be the random variable that represents the difference between the actual frequency f (e) of the item e and the value of the j th counter in the i th array, i.e., X i,(e) [j] = A i [j] − f (e) , where j = h i (e). Due to hash collisions, multiple items will be mapped by the hash function h i (.) to the counter j, which increases the value of A i [j] beyond f e and results in over-estimation error. As all hash function have uniformly distributed output, P r[h i (e 1 ) = h i (e 2 )] = 1/w. Therefore, the expected value of any counter A i [j], where 1 i d and 1 j w, is αN/w. Let and δ be two numbers that are related to d and w as follows: d = ln(1/δ) and w = exp / . The expected value of X i,(e) [j] is given by the following expression.
Finally, we derive the probabilistic bound on the overestimation error of the Silm-subsketch of the SF F -sketch.
αN ]) d Substituting the value of αN from Equation (2) into the right side of the equation above, we get
Derivation of Correct Rate: The Correct Rate of a sketch is defined as the expected percentage of items in the given multi-set for which the query response of the sketch contains no error. In deriving the correct rate of SF F -sketch, we make two assumptions: 1) all hash functions are independent; 2) the Fat-subsketch is large enough to have negligible error. Before deriving the correct rate, we first prove the following theorem.
Theorem 1: In the Slim-subsketch, the value of any given counter is equal to the frequency of the most frequent item that maps to it.
Proof: We prove this theorem using mathematical induction on number of insertions, represented by k.
Base Case, k = 0: The theorem clearly holds for the base case because with no insertions, the frequency of the most frequent item is currently 0, which is also the value of all counters.
Induction Hypothesis, k = n: Suppose the statement of the theorem holds true after n insertions.
Induction
Step, k = n+1: Let n+1 st insertion be of any item e that has previously been inserted a times. Let α i (k) represent the values of the counter A i [h i (e)] after k insertions, where 0 i d − 1. There are two cases to consider: 1) e was the most frequent item when k = n; 2) e was not the most frequent item when k = n.
Case 1:
If e was the most frequent item when k = n, then according to our induction hypotheses, α i (n) = a. After inserting e, it will still be the most frequent item and its frequency increases to a + 1. The counter A i [h i (e)] will be incremented once. Consequently, we get α i (n + 1) = a + 1. Thus for this case, the theorem statement holds because the value of the counter A i [h i (e)] after insertion is still equal to the frequency of the most frequent item, which is e.
Case 2:
If e was not the most frequent item when k = n, then according to our induction hypotheses, α i (n) > a. After inserting e, it may or may not become the most frequent item. If it becomes the most frequent item, it means that α i (n) = a + 1 and as our SF F scheme, the counter A i [h i (e)] will stay unchanged. Consequently, we get α i (n + 1) = α i (n) = a + 1. Thus for this case, the theorem statement again holds because the value of the counter A i [h i (e)] after insertion is equal to the frequency of the new most frequent item, which is e.
After inserting e, if it does not become the most frequent item, then it means α i (n) > a + 1 and as our SF F -sketch scheme, the counter A i [h i (e)] will stay unchanged. Consequently, α i (n + 1) = α i (n) > a + 1. Thus, the theorem again holds because the value of the counter A i [h i (e)] after insertion is still equal to the frequency of the item that was the most frequent after n insertions.
Next, we derive the correct rate of the SF F -sketch. Let v be the number of distinct items inserted into the slimsubsketch and are represented by e 1 , e 2 , . . . , e v . Without loss of generality, let the item e l+1 be more frequent than e l , where 1 l v − 1. Let X be the random variable representing the number of items hashing into the counter A i [h i (e l )] given the item e l , where 0
From Theorem 1, we conclude that if e l has the highest frequency among all items that map to the given counter A i [h i (e l )], then the query result for e l will contain no error. Let E be the event that e l has the maximum frequency among x items that map to A i [h i (e l )]. The probability P {E} is given by the following equation:
Let P represent the probability that the query result for e l from any given counter contains no error. It is given by:
As there are d counters, the overall probability that the query result of e l is correct is given by the following equation.
The equality above holds when all v items have different frequencies. If two or more items have equal frequencies, the correct rate increases slightly. Consequently, the expected correct rate Cr of slim-subsketch is bound by:
IV. IMPLEMENTATION
In this section, we describe our implementation of the sketches on two different computing platforms namely CPU and GPU. We extensively tested and evaluated SF-sketch and compared its performance with prior sketches on these two platforms. Next, we first describe our implementation on the CPU platform and then describe our implementation on the GPU platform.
A. CPU Implementation
Our CPU platform comprised a machine with dual 6-core CPUs (24 threads, Intel Xeon CPU E5-2620 @2 GHz) and 62 GB total system DRAM. Each CPU has three levels of cache memory: L1, L2, and L3. L1 cache is comprised of two 32KB caches, where one cache acts as the data cache and the other acts as the instruction cache. L2 cache is a single 256KB cache and L3 cache is a single 15MB cache. To evaluate the schemes in different types of settings, our implementations on the CPU platform include both singlethread implementation as well as multi-thread implementation. We used C++ as the programming language. In single-thread implementation, for each sketch, we implemented the entire insertion, deletion, and query process within a single thread. In multi-thread implementation, we run each query in a dedicated thread and process it completely inside that thread, observing near-linear growth in query speed with the increase in the number of threads. We will present the results on query speed in more detail in Section V-C1.
B. GPU Implementation
As GPUs have seen wide acceptance for high-speed data processing, we implemented our sketches on GPUs as well. For these implementations, we employ the basic architecture of GAMT [26] . More specifically, we evaluated the sketches on GPU platform using CUDA 5.0 architecture. We performed our experiments on a DELL T620 server, with an Intel CPU (Xeon E5-2630, 2.30 GHz, 6 Cores) and a NVIDIA GPU (Tesla C2075, 1147 MHz, 5376 MB device memory, 448 CUDA cores). We implemented our sketches on GPU using two prevalent techniques: batch processing and multi-stream pipelining. Next, we describe our implementations for these two techniques.
1) Batch Processing: Our system architecture is based on CUDA [29] , the well-known parallel computing platform created by NVIDIA. In our implementation, a typical query cycle proceeds in following three steps: (1) copy the incoming queries from the CPU to the GPU, (2) execute the query kernel, and (3) copy the result from the GPU back to the CPU. A kernel in CUDA is a function that is called on CPU but executed on GPU. A query kernel is configured with a series of thread blocks, where each block is comprised of a group of working threads. As GPU chips have hundreds and even thousands of cores, batch processing is needed to accelerate GPU-based implementations. Each batch is first filled with a group of independent queries, and then transferred to and executed on the GPU, i.e., as soon as a query arrives, it is buffered until there are enough queries to fill the current batch of queries before transferring the batch to GPU for processing by the query kernels. Note that in practice, not all the queries are processed simultaneously, but rather GPU's scheduler decides when to process which query. As CPUs support less parallelism compared to GPUs and the additional memory accesses to the CPUs may deteriorate the batch processing performance of GPUs, in our implementation, all d arrays are stored on the GPU to ensure that the operations to access the arrays are executed completely within the GPU.
2) Multi-Stream Pipeline: As discussed earlier, batch processing is required to take maximum advantage of the massive parallelization that GPU enables. However, waiting for enough queries to fill a batch before sending the batch to GPU results in unnecessary delays. Furthermore, while a large batch does boost the throughput of the GPU, it increases the waiting time before a batch fills and is transferred to GPU for processing. This means that the query that arrived at the start of the current batch will experience significant latency before it is processed. To resolve this throughput-latency dilemma, we utilize the multi-stream technique featured in NVIDIA Fermi GPU architecture [26] , [31] . A stream, in this context, is a sequence of operations that must be executed in a certain order.
As per CUDA architecture, data transfers and kernel executions within different streams can be concurrent as long as the device supports concurrent operations and the host memories used to exchange data between the CPU and the GPU are pagelocked. In this way, when one stream is copying data between the CPU and GPU, another stream can execute query kernels in parallel. As a result, the streams behave as a multi-stage pipeline and reduce the total processing time.
Furthermore, a large batch can be divided into several smaller ones, reducing the average lookup latency while keeping the throughput high. Given a batch of requests and a sequence of active streams, the task mapping should be performed in as balanced way as possible to efficiently use GPU's parallelism. Let b and n denote the batch size and the number of active streams, respectively. If b is just a multiple of n, the whole batch can be evenly divided. Otherwise, the first b % n streams may need to perform an extra operation. In our implementation, after dividing the whole batch into multiple smaller batches of approximately identical sizes, we use their offset and size information for task mapping. Each small batch is processed by the specified stream, and all streams are launched one after the other to work as a multi-stage pipeline.
V. EXPERIMENTAL RESULTS
We conducted extensive experiments to evaluate the performance of our SF F -sketch in terms of accuracy and speed. Onwards, we will refer to the SF F -sketch as simply the SFsketch. For comparison, we also implemented and evaluated the performance of four well known sketches, namely the Count-sketch (C-sketch) [18] , the CM-sketch [19] and the CU-sketch [20] and one recently proposed sketch, namely the CML-sketch [21] . CML-sketch and CU-sketch do NOT support deletions. CML-sketch and Count-sketch suffer from both overestimation and under-estimation errors.
A. Experimental Setup
Datasets: We use three types of of datasets: real world traffic, uniform dataset, and skewed dataset. The real world network traffic trace is captured by the main gateway of our campus, while the uniform and the skewed datasets are generated by the well known YCSB [27] . We keep the skewness of our skewed dataset equal to the default value for YCSB, which is 0.99. We use Memcached [32] to record the real frequency of each item to establish the ground truth.
Experimental Comparison: To give advantage to the stateof-the-art sketches, we store them in SRAM. However, for our SF F -sketch, we store the Slim-subsketch in SRAM and the Fat-subsketch in DRAM. Although, the DRAM Fat-subsketch is the overhead of our scheme, but DRAM is cheap and large in size. Thus, such overhead is acceptable and reasonable in practice. In our experiments, we allocate the same SRAM memory size to the state-of-the-art sketches and to the Slimsubsketch. For update experiments, we compare them by varying item frequencies and operation size, i.e., the number of insertion and deletion operations.
B. Experiments on Accuracy
We use relative error (RE) to quantify the accuracy of sketches. Let f e represent the actual frequency of an item e and letf e represent the estimate of the frequency returned by the sketch, the relative error is defined as the ratio |f e −f e |/f e . To evaluate accuracy, we used 100K distinct items and fixed parameter setting (d = 5, w = 40000, z = 3). We calculated relative errors for different sketches in three settings: (1) by incrementally increasing the number of insertion operations;
(2) by incrementally increasing the number of deletion operations; and (3) by first increasing the number of insertion operations and then deleting the inserted items one by one in reverse order. We performed experiments in these three settings for both uniform and skewed workloads. We also conducted experiments to quantify the effect of system parameters on the performance of the sketches. In all our experiments on accuracy evaluation, we use 100K (= 100×10 3 ) distinct items in total.
1) Uniform Workload: Relative Error CDF: Our experimental results show that the percentage of items for which the relative error of our SF-sketch is less than 1% is 74.51%, which is 18.8, 4.3, 2.1 and 1.9 times higher than the corresponding percentages for CML, C, CM and CU-sketches, respectively. Figure 5 reports the empirical cumulative distribution function (CDF) of relative error for the 100K distinct items after a total of 10M (= 10×10 6 ) insertions. Specifically, we first inserted the 100K distinct items for a total of 10M times such that the probability of occurrence for each item was uniformly distributed, and then calculated the relative errors in the estimates of the frequencies of those 100K distinct items. In this way, we got 100K values of relative error for each of the five sketches (CML, C, CM, CU and SF-sketches). We then plotted a CDF using the 100K relative error values for each sketch. We observe from Figure 5 that the CDF of the SFsketch is not only higher than that of the other four sketches but also ascends sharply near relative error of 0. This indicates that the relative error in the estimate of the frequencies of most items, calculated from the SF-sketch, is very close to 0. Figure 6 plots the average relative errors in the estimate of the frequencies of the 100K distinct items obtained from the five sketches for different number of insertions. We observe from this figure that the average relative errors of the five sketches converge to different fixed values with increasing number of insertions with our SF-sketch being the most accurate. The converged average relative error of our SF-sketch is 6.2, 24.0, 33.1 and 3.8 times smaller than the Figure 7 plots the average relative errors in the estimate of the frequencies of the 100K distinct items obtained from the three sketches for different number of deletions. Before starting the deletions, we inserted the 100K items 10M times for each sketch. Note that Figure 7 does not include results for CML and CU-sketches because they do not support deletions.
Increase in Error due to Deletions: Our experimental results show that our SF-sketch looses some accuracy due to deletions, while C and CM-sketches do not. Despite that, the average relative error of SF-sketch is still [2.1 to 24.1] times lower compared to the C-sketch and [2.4 to 33.4] times lower compared to the CM-sketch. In this experiment, we first inserted the 100K items 10M times, and then deleted them in reverse order as the insertions. After every 100K insertions, we calculated the average relative error for the 100K distinct items and plotted them in Figure 8 . Similarly, after every 100K deletions, we calculated the average relative error for the 100K distinct items and plotted them in Figure 8 . The lines with hollow square/triangle/circle indicate average relative errors calculated after insertion operations, while the lines with solid square/triangle/circle indicate average relative errors calculated after deletion operations. Again, Figure 8 does not include results for CML and CU-sketches because they do not support deletions. We observe from this figure that the lines with hollow square/triangle for the C and CM-sketches perfectly track the corresponding lines with solid ones, which means that deletion operations do not reduce accuracy in the C and CM-sketches. However, the line with hollow circle for the SF-sketch lie below the corresponding line with solid circle showing that the deletion operation deteriorates the accuracy of SF-sketch. The reason is that when deleting an item, SF-sketch cannot always decrement all counters of Slim-subsketch that it incremented when inserting that item because decrementing all those counters can lead to underestimation error. Thus, SF-sketch supports deletions at the cost of slightly reduced accuracy after deletions.
2) Skewed Workload: For skewed workloads, we performed exactly the same experiments as for the uniform workloads. The results from these experiments are shown in Figures 9, 10, 11 , and 12. The trends in these figures for the skewed distribution are similar to what we observed for the uniform distribution. Therefore, for the sake of brevity, next, we concisely report the results without describing again how the experiments were conducted.
Relative Error CDF: Our experimental results, reported in Figure 9 , show that in case of skewed workload, the percentage of items for which the relative error of our SF-sketch is less than 1% is 74.30%, which is 21.3, 4.6, 2.1 and 1.7 times higher than the corresponding percentages for CML, C, CM and CUsketches, respectively. 3) Real Traffic: We also used real traffic to evaluate the accuracy of sketches. As real traffic does not have deletion operations, we only show results for relative error CDF and relative error with respect to insertions. We have 10M real traffic instances and regard the traffic with the same destination IP address to belong to the same flow. Using this definition of flow, there are about 230K flows in our data set, and the size distribution of flows is biased with expected value of 8.1 and variance of 1606. We set d = 5, w = 300000, z = 20 in this set of experiments.
Relative Error CDF: Our experimental results, reported in Figure 13 , show that in case of real world traffic, after a total of 10M insertions of the 230K distinct items, 99.81% items with our SF-sketch have no error, while the percentages of items with CML, C, CM and CU-sketches are 72.26%, 79.28%, 96.21% and 99.06%, respectively.
Relative Error vs. # of Insertions: Our experimental results, reported in Figure 14 , show that for our real world traffic, the average relative error of SF-sketch is [25.4 to 4246.2], [98.0 to 1341.8], [11.8 to 17.3] and [3.5 to 4.8] times smaller than the average relative errors of CML, C, CM, and CU-sketches, respectively. 4) Sketch Parameters: Next, we evaluate the effect of changing the system parameters d (the number of arrays) and w (the number of buckets per array) on the accuracy of the sketches. In each experiment to evaluate the effect of system parameters, we insert the 100K distinct items 10M times.
Accuracy vs. w: Our experimental results show that the CU-sketch requires 1.5 times more memory compared to the SF-sketch to achieve close to 1% average relative error. Figure 15 plots the average relative error by varying the number of buckets per array with d fixed at 5. We observe from this figure that increasing the number of buckets per array reduces the average relative error for each sketch. However, we observe that at 30K buckets per array, the average relative error of our SF-sketch reduces to a very small value of just 0.047. On the contrary, the CU-sketch requires 50K buckets per array to achieve an average relative error of 0.049, but note that it does not support deletions. The CML, C and CM-sketch did not achieve close to 0 average relative errors in our experiment.
Accuracy vs. d: Our experimental results show that our SF-sketch achieve an average relative error of 5.6% using only 3 arrays whereas the CU-sketch takes 6 arrays to come close to the error of SF-sketch and achieves the average relative error of 7.1%. This shows that compared to CU-sketch, at this error rate, SF-sketch takes only half as much SRAM memory. Figure 16 plots the average relative error by varying d with w fixed at 40K. We observe from this figure that using 6 arrays, SF-sketch achieve an average relative error of 1.9%. We also observe that increasing the number of arrays reduces the average relative error for all the sketches.
C. Experiments on Query Speed
Next, we evaluate the throughput and latency in processing queries. Throughput is quantified in terms of number of queries processed per second. Latency is measured in microseconds and quantifies the time duration between submitting a query and receiving the response. We present results from our evaluation of throughput and latency on multi-core CPU as well as on GPU platforms. We observed from our experiments that the throughput of all five sketches (i.e., CML, C, CM, CU and SF-sketches) was almost the same. This observation was expected because the query operation of the these sketches is almost the same. For this reason, in rest of this section, we only present results for SF-sketch. 1) Multi-core CPU Platform: Our experimental results show that the SF-sketch experienced a throughput gain of about 650K queries per second per thread up to 24 threads. Figure 17 plots the throughput vs. the number of threads for the SF-sketch. We observe from this figure that SF-sketch achieves a throughput of about 1.34M queries per second with a single thread. For this experiment, we performed 10M queries. Using 24 threads, it achieves a throughput of about 16.3M queries per second. We further observed that increasing the number of threads beyond 24 hardly brought about any improvement because our CPU has 6×2 cores, which support 6×2×2 = 24 threads. This suggests that the query speed of our SF-sketch increases linearly with the number of CPU cores.
2) GPU Platform: Our experimental results for three different data sets show that the query speed in GPU increases with the increase in the batch size. As shown in Figure 18 , for the batch size of 20K queries, the query speed is around 50 million queries per second (Mqps). With increase in the batch size, such as 64K queries per batch, SF-sketch reaches a query speed higher than 110 Mqps.
Our experimental results for three different data sets show that for SF-sketch, to reduce latency, the batch size of 28K is the most optimal for our experimental setup. Figure 19 shows that the the average query latency of SF-sketch is below 410 µs for batch sizes 28K. For batch sizes 32k, the latency increases to 511 ∼ 584 µs.
VI. CONCLUSION
In this paper, we proposed a new sketch, namely the SF-sketch, which achieves up to 33.1 times higher accuracy compared to the CM-sketch while using the same amount of SRAM memory and keeping the query and update speeds as fast as the CM-sketch. The key idea behind our proposed SF-sketch is to maintain two separate sketches, one in the fast memory (SRAM) called Slim-subsketch and another in the slow memory (DRAM) called Fat-subsketch. The Slimsubsketch enables SF-sketch to achieve high query speed while the Fat-subsketch enables it to achieve high query accuracy. To evaluate and compare the performance of our proposed SF-sketch, we conducted extensive experiments on multi-core CPU and GPU platforms. Our experimental results show that our SF-sketch significantly outperforms the-state-of-the-art in terms of accuracy.
