ABSTRACT e ability to detect, in real-time, heavy hi ers is beneficial to many network applications, such as DoS and anomaly detection.
INTRODUCTION
"Heavy hi er" flows, i.e. flows with large traffic volumes, comprise less than 10% of all flows in a data-center network, but carry most of the bytes transmi ed in the network [6] . Additionally, more than 80% of flows last less than 11 seconds and carry less than than 10KB of data (just a few packets), while only ≈ 0.1% last longer than 200s [6, 15] .
is has interesting implications for traffic engineering, and quickly distinguishing between these two types of flows on a short time-scale is important for several applications such as DoS (Denial of Service) and anomaly detection, flow-size aware routing, and ality of Service (QoS) management. Programmable switches, along with network programming languages such as P4 [7] , offer new possibilities to detect heavy hi er flows directly in the data-plane while the packets are being processed. Consequently, specialized actions can be applied to these packets (e.g. providing higher or lower QoS or rerouting to avoid congestion), allowing network operators to respond to short traffic spikes quickly. is way, traffic flows belonging to applications that have very strict latency, ji er, and bandwidth requirements, such as the Tactile Internet, could be easily identified, enabling switches to treat them differently by providing per packet QoS [22] .
Existing data-plane solutions such as HashPipe [20] use memory and processing-efficient data-structures to count packets. However, they lack a mechanism to remove outdated information from the data-structure and rely on periodic flushing of the switch's memory. As a consequence, flows detected in the previous window are forgo en and need to be detected again each time the structure is flushed, thus decreasing accuracy and increasing detection time. In addition, flushing the memory of counting data-structures will lead to inconsistencies, as all memory can not be flushed simultaneously.
is is especially prominent at switches that process hundreds of millions of packets every second.
A sliding window over the last N packets solves the aforementioned problems by ensuring that only information about the last N packets is present in the switch.
is approach optimizes the detection time, increases accuracy, and has no need for special actions from the control-plane (e.g. register flushing) [5] . However, despite these benefits, no efficient practical implementation of a heavy hi er algorithm using a sliding window targeting programmable networking devices exists.
Existing sliding window approaches use dynamic memory allocation or complex data-structures such as linked lists. Maintaining these structures requires many read/write actions, while switches with many 10-100GE ports have only a small time budget available if they want to maintain a high processing throughput (up to a few Tbps) [20] . Existing hardware solutions that are optimized for low memory consumption (WCSS [5] , Memento [4]) were not developed with P4 and programmable hardware in mind and generally exceed the available processing budget by using too many memory accesses per processed packet to maintain the window and counting structure.
In this paper, we present a solution for heavy hi er detection using a sliding window approach that is designed and optimized for programmable network hardware by minimizing the processing overhead. at is, we minimize the additional number of cycles spent per packet to execute the heavy hi er algorithm. Additionally, in order to target different programmable hardware our solution is tunable with respect to memory usage and the number of stages in the switch. By increasing the available memory, the accuracy of our approach can be increased while keeping the processing time constant.
PROBLEM STATEMENT
Detecting heavy hi ers is a type of "frequent items" problem. at is, given a stream S, and a packet p belonging to a flow , the goal is to determine if more than x% of the last N packets of S belong to flow . O en, an algorithm to solve this problem will do so by keeping track of frequency estimates f . Two possible errors can occur with such algorithms: (1) false positives, that is, falsely detecting a packet as belonging to a heavy hi er flow, and (2) false negatives, that is, failing to recognize a packet as belonging to a heavy hi er flow. While both of these errors should be minimized, false positives are typically preferred over false negatives, as accidentally ignoring heavy hi er traffic can have a significant impact.
We identify two sub-problems: (1) keeping track of packet counts to determine if packets are heavy hi ers (Sec. 3), and (2) tracking the sliding window of N packets by reducing the counts of packets that leave the window (Sec. 4). We give multiple solutions for each of these sub-problems, which can be combined arbitrarily to solve the overall heavy hi er detection problem on programmable hardware.
Hardware Constraints
Switches, especially those at the core or those processing large amounts of data, have to process a large number of unique flows and only have limited hardware resources available. As a result, it is unfeasible and not scalable to store and maintain all flow frequencies in the data-plane. In general, compared to heavy hi er detection outside of the data-plane, the amount of memory available is much more limited, severely constraining any heavy hi er detection algorithm.
More importantly, to avoid a drop in throughput, packets need to be processed as fast as they arrive (at line rate), only allowing for a processing budget of nanoseconds. As an example, for a 100GE link and packets of size 64B the processing time per packet needs to be smaller than 6.88ns. On current programmable hardware, memory accesses consume most processing cycles, so these should be limited as much as possible. Typically, on some hardware just one read-modify-write action per each register array is allowed.
COUNTING SKETCH
To keep track of frequency estimates and identify heavy hi ers, we make use of sketches. Sketches are compact data structures that can be used to efficiently store large amounts of data. Instead of storing all data, they only store a summary of the data. is way, they trade in accuracy for memory. Typically, these data structures are probabilistic and make liberal use of hashing.
Sketches are usually optimized for low memory consumption, but do not track the flow identifiers of packets. However, as our goal is to identify heavy hi er packets while they are processed, storing flow identifiers is not needed.
A hash table as the main building block
Hash tables guarantee constant query and update time, have fixed memory footprint, and are supported by all programmable hardware. us, they are ideal as main building blocks for a counting sketch.
As the amount of unique flows k will o en be significantly larger than the table width, there will be a large number of hash collisions. To keep the update and query time constant (and minimize the number of memory accesses), we do not resolve collisions (by exporting them to the CPU when a collision is detected as explained in [21] ), but simply seek to minimize their number and impact.
We define the load factor λ i of a hash table i of size width i that stores flow statistics of k i unique flows entries as
is variable describes how filled up the table currently is. For example, a hash table with load factor 0.25 is 25% "full. "
If the hash function is uniform, the number of flows X j that are mapped to a single table entry j follows the binomial distribution B(k i , 1/width i ). Now, the number of collisions of flow entry j, C j , is
us, the expected number of collisions of each table entry is
As collisions directly scale with k i , table widths should scale with the number of processed unique flows k i instead of the number of packets in the window N . e load factor λ i directly influences the probability of false positives. For every heavy hi er flow that is detected, on average, approximately λ i additional small flows are falsely identified. Additionally, multiple smaller flows can also either cause the heavy hi er flow to be detected prematurely or can add up together to the defined heavy hi er threshold. By adding more memory to our counting hash tables or by using multiple of them, a be er accuracy can be achieved and the probability of false positives reduced (see Sec. 3.2 and Sec. 3.3).
Count-Min sketch
As a first approach, we implement the Count-Min sketch [11] in P4. e Count-Min sketch is a probabilistic data structure for storing frequencies. It consumes very li le memory, but this comes at the cost of potentially overestimating frequencies.
e CountMin sketch stores frequencies in a two-dimensional array (multiple hash tables).
e width of each table is smaller than the total number of unique flows, so flow identifiers are hashed to generate an index. To reduce the effect of collisions, the frequency of each flow is simultaneously maintained in multiple tables, each of which is indexed by a different hash function as shown in Fig. 1 . e frequency is obtained by taking the minimum of these values. As a flow will collide with different flows in each hash table, the probability of falsely identifying a flow as a heavy hi er is reduced. For a perfect window, if the width of the sketch is set to ⌈ e ϵ ⌉, and the depth to ⌈ e ϵ ⌉, the probability that the estimated frequency f is smaller or equal to f +ϵ ·N is at least 1−σ [11] . us, by increasing the width we can decrease the overestimation error bound ϵ, while by increasing the depth we can increase the probability of staying within that error bound.
P4 implementation. When implementing Count-Min sketch in P4, one register array (to store the flow counts) and two matchaction tables are needed per depth (except for d = 1).
e first match-action table updates the flow counts of the corresponding register array and makes sure that this count is always between 0 and N . To calculate the index for this register array, we first calculate a hash of the packet identifier and then perform a modulo operation on that hash using the size of the register array. e second match-action table is used to determine the minimum between two flow counts from two successive register arrays. is way, as the packet passes through each table, the current minimum is always saved in a metadata variable that is at the end compared against the heavy hi er threshold. Finally, if the minimum count exceeds this heavy hi er threshold, a metadata variable to indicate this is set to 1 using a separate match-action table.
Memory consumption. e total memory consumption of the data-structure presented in the previous subsection can be calculated as:
Gated Sketch
To use the switch's memory more efficiently, we have developed a new sketch. It uses a set of hash tables of different widths, but, unlike the Count-Min sketch, does not update every hash table for each processed packet. When a new packet arrives, counters from the hash tables are compared against a set of thresholds th 0 to th d −1 whose sum equals the heavy hi er threshold th:
where th i is the threshold of hash table i. e packet is only processed by the next table i + 1 if the counter value of the current table i is higher than its threshold th i (see Fig. 2 ). If the counters from all hash tables satisfy their respective thresholds, the packet is identified as a heavy hi er. is approach has multiple advantages. First, the number of collisions at deeper hash tables is reduced, as less packets are processed by them. As a consequence, the width of the deeper tables can be reduced without losing much accuracy.
is reduces the overall memory consumption and makes it possible to trade-off the width of the deeper tables for the width of the first table. Fig. 3 shows the average number of flows processed by the second stage depending on the width of the first stage (width 0 ) and the threshold of the first stage (th 0 ) calculated using CAIDA traces from 2016 collected on an ISP backbone router [1] . If we choose th 0 = 0, all packets are processed by the deeper stages (as in the Count-Min sketch). By increasing the value of th 0 the number of unique flows processed by the second stage drops significantly. Similarly, by increasing the value of width 0 , the number of flows that pass to the second stage due to collisions decreases. Second, by using a Gated Sketch the average processing time per packet can be reduced, as many packets are not processed at deeper tables.
Finally, the P4 implementation is simpler and uses the switch resources much more efficiently than the Count-Min sketch, as explained bellow. P4 implementation. When implementing the gated sketch in P4 only one register array and one match-action table, to maintain the flow counts, are needed per depth. e match-action table is used to update the flow count in the corresponding register array. Additionally, before a match-action table is applied in the ingress control block, it is checked if the count from the previous table satisfies its respective threshold.
is significantly simplifies the design when compared to the Count-Min sketch and the number of tables that are needed is reduced by a factor of 2. Memory consumption. e total memory consumption of the data-structure presented in the previous subsection can be calculated as:
4 SLIDING WINDOW e sketching approaches described in the previous section count all packets that are received by the node since it was started (or since the register values were reset). However, only recent packets in the stream are relevant and represent the current state in the network.
If register values are reset every N packets, the probability of false negatives at the beginning of the window can be significant. Additionally, rese ing all the counts on a switch requires an action from the control-plane. In case of a heavy hi er sketch that is running in the data-plane, the state in the switches (e.g. flow counts) will change at line rate (at speeds that can reach Tbps), preventing any so ware-based controller from consistently rese ing all the used register arrays. Additionally, small window sizes, such as 2 16 packets, correspond to not more than a fraction of a second on a 100Gbps link. Rese ing a state from a controller on such a short time-scale is ineffective, requires too many actions and is possible only based on time (every T seconds), and not on the number of packets.
is problem can be solved with a sliding window over the last N packets. If such a data structure is added to the counting sketch, outdated flows and counts can be removed from the counting sketch.
Ring sliding window
Our first approach in implementing a sliding window is to keep track of the flow identifiers for the last N packets in an array, similarly to the way described in [2, 3] . Every time a new packet arrives the oldest entry from the array is removed and replaced with the new flow identifier. A erwards, counts for the flow that was removed are reduced in all hash tables as shown in Fig. 4 . e main advantage of this approach is high accuracy. All hash tables only contain the counts of the last N packets, and the probability of false negatives is equal to 0. In case of collisions, the frequency of flows can be overestimated, but can never be underestimated. However, as the ring structure takes up a large amount of memory, it is not practical for larger window sizes. e array of flow identifiers (5-tuples) shown in Fig. 4 takes up 13×N bytes and the index register, used to store the position of the oldest packet in the ring, an additional lo 2 (N ) bytes.
Memory consumption. To save memory, it is possible to just store the values h i (f low id) for each depth.
us, the memory consumption of the presented structure is equal to:
Since programmable switches have limited memory (typically 1.4MB per stage [20]), the ring structure becomes infeasible for N ≥ 2 20 (between 0.1 and 1.3 s on a 10Gbps link) even if the depth of the counting structure is equal to 1 (Fig. 5) . By increasing the depth of the counting structure, the memory doubles. P4 Implementation. To implement this structure in P4, we need to either save N packet identifiers (5-tuple) or all the indices for counting register arrays that were increased while the last N packets were processed. In the first case, 5 additional register arrays are needed to store the flow identifier (source and destination IP address, protocol field and source and destination port). In the second case, d additional register arrays are needed to store the indices for each counting register array. Before a new packet is processed, an additional table to read the flow identifier (or all the register indices) of the oldest packet is applied.
is significantly increases the overhead of the heavy hi er algorithm, since the number of memory accesses per register array (typically one read-modify-write action is available) as well as the total number of register accesses is limited on most programmable switches and can lead to drops in throughput.
For every incoming packet, two counts need to be modified for each counting register array: (1) the count of the flow of the newest packet is increased and (2) the count of the flow of the oldest packet in the window is decreased. As a consequence, this solution is not feasible on programmable hardware that stores the register values in the local memory and has a limit on the number of read-modifywrite actions per register. On programmable hardware that uses shared memory (e.g. Netronome) this limitation is not present, but the number of memory accesses is high and can cause a drop in throughput. In addition, as packets are processed in parallel, shared memory can lead to race conditions. As a consequence, the probability of false positives as well as false negatives (which was 0) will increase.
Sequential sliding window
To develop a solution that is feasible on programmable hardware as well as for larger values of N (in contrast to the previously described Ring window), we have developed a solution that only needs lo 2 (width) additional bits to maintain the sliding window. Every time a packet is added to the sketch, we also reduce all counts in a row (determined by a sequential index) as shown in In this scheme, the probability of false positives and false negatives can be significant, as, in contrast to the ring implementation, we do not reduce the flow counts of the oldest packet of the sliding window. Moreover, many entries in the tables can be 0 (depending on the width of the table). As these counts can not be further reduced, and one other count will be increased, the total number of counts increased per window can be larger than the total number of counts reduced. As these counts are never removed, accuracy decreases over time. Additionally, when a heavy hi er flow is completed, it takes many cycles for that flow to be removed from the tables causing potential collisions with the newer flows and increasing, as a consequence, the probability of false positives. False negatives are possible since, at the time an entry is increased in the table, the same entry can be removed.
However, the simplicity and the fact that only lo 2 (width) additional bits of memory are needed makes this approach suitable for programmable network hardware.
P4 Implementation. When implementing this scheme in P4, just one additional index needs to be maintained. Every time a match-action table is applied to update the count for the newest packet, one count from the same register array is reduced using this sequential index. A erwards, the sequential index is increased by 1 and saved.
In addition, this scheme is easily implementable on all programmable hardware, as the number of memory accesses per each register array can be reduced to one. To do this the counting register array needs to be split in two tables, as shown in the value of the hash(id) (hash of the flow identifier of the first received packet), a value from either the first or the second table will be decreased. As a consequence, the total memory consumption of this extended data-structure is increased and equal to:
as 2 indices (one for each half of the table) are needed per depth.
Sequential flushing
e main idea of this approach is to reset the counting structure in every window N . For every m-th (where m = N /width) packet that is added to the sketch, we also reduce all counts in a row (determined by a sequential index) as shown in Fig. 8 . is way, a er N (window) packets are processed, all the registers values have been reset to 0. e main advantage of this scheme, in contrast to the Sequential window, is that it can maintain accuracy over time since the whole structure is reset every N packets. In this scheme, false positives and false negatives will always be present, as in contrast to the ring implementation we do not reduce the flow counts of the oldest packet of the sliding window.
In addition, counts across columns are inconsistent, as we flush the counts of different flows in each column.
Similarly to the sequential window, the simplicity and the fact that this solution only needs additional depth · lo 2 (width) bits of memory makes this approach suitable for programmable network hardware. Just as the sequential window, this solution is implementable on all available P4 hardware using the extension presented in Fig. 9 . 
Hybrid window
is approach improves upon the previously implemented ring structure (Sec. 4.1). e main disadvantage of the ring approach is its high memory consumption: every flow identifier of the last N packets needs to be stored inside a register array of size N .
To reduce memory usage, we propose a new ring structure that stores a smaller number of identifiers (Fig. 10) . Instead of removing packets from the counting sketch as soon as they leave the window, our structure removes packets in batches of th · N /m, (threshold as a percentage times N /m) at a time (similarly to [5] ). To keep track of heavy hi ers, it adds an additional structure -for counting the number of times entries reached th · N /m -to the sketch (shown on the right side of the Fig. 10 ). Every time an entry of the sketch reaches th · N /m, the entry is set to 0, and we increase the count of the right structure by 1. Now, to identify if a packet is a heavy hi er, we check if this count is larger or equal than m. To remove a batch of th · N /m packets, we simply reduce the count by 1. To make sure our window is of size N , we reduce the th · N /m count of an entry exactly N packets a er we increase it.
is window structure is implemented using two arrays: (1) a flowid array containing flow identifiers of packets that reached th · N /m (third array in Fig. 10 ) and (2) a bit array of size N specifying when the th · N /m count was increased (second array in Fig. 10 ). In addition to the counting sketch itself (to count up to th · N /m), this approach requires an additional counting sketch to count the number of times every count reached th · N /m.
In order to implement the two data structures needed to maintain the window (flowid array and the bit array) three additional indices are needed: (1) index1 used to keep track of the current position in the window of size N , (2) first used to keep track of the place in the flowid at which a new packet will be added, and (3) last used to point to the location in flowid that is storing the oldest entry that was added.
If a batch of th · N /m packets needs to be removed (the value of the bit array is 1), a value from flowid is read using the last index. Subsequently, that flowid row is set to 0, and the value of last incremented by 1 to point to the new oldest item as shown in Fig. 11 . Similarly, if a packet is added to the flowid array, the value of first is incremented by 1 and the value of the flowid row updated (Fig. 12) . A problem with this window is that smaller flows slowly accumulate in the counting sketch, before finally being removed a er they hit th · N /m. is increases the number of false positives. To alleviate this problem, we add a smaller pure ring of size N /m to remove packets added to the initial counting sketch (not from the th · N /m counter) a er N /m packets. A heavy hi er typically already hits th · N /m during this time, so the accuracy of its count estimate is not affected by much. However, smaller flows are effectively filtered out.
Removing packets from the data-structure. Every time a new packet arrives, we read the value from the bit array pointing to the oldest received packet (N packets before the packet that is currently processed) using an index (Index1) variable. Index1 increases by one for every processed packet and always points to the oldest entry in the bit array. If the value in the bit array was 1, the count corresponding to that flow is reduced in the counting hash table (data-structure on the right in Fig. 10 ), its bit in the bit array is set to 0, and its flow identifier is removed from the flowid array. Additionally, if an additional pure ring is used a flow needs to be removed from the first counting sketch as described in Sec.
4.1.
Adding new packets to the data-structure. When a new packet arrives, it is added to the initial counting structure (le datastructure in Fig. 10 ). Every time a flow count of the incoming packet reaches a fraction of the threshold (th · N /m) we set the bit in the bit array to 1 and save the flow identifier in the flowid array table. Consequently, we increase the value in the counting hash table (data-structure on the right in Fig. 10 ) by 1. To approximate the frequency of an item in the window, we check if this flow ever reached N ·th/m. If it did, we read the number from the third table and multiply the result with th/m. Alternatively, we conclude that the flow is not a heavy hi er and approximate the frequency with the count present in the first counting table.
Memory consumption. e total memory consumption of the presented structure is:
M h br id = width 1 · lo 2 (N /m) + N + width 2 · lo 2 (width3)
e maximum width (width 2 ) of the flowid array can be calculated using the threshold th to detect heavy hi ers. When calculating this, we need to consider two consecutive windows of size N (the last N packets that need to be removed and the new N packets that need to be added). In the worst case, if all the counts in the initial sketch have values of N · th/m − 1 at the same time, and in the next width1 packets reach N ·th/m, they create width1 packets that are added to the flowid array. e 2N − width1 packets le in this window can cause at most (2N − width1) · m/(N · th) packets to reach N · th/m. e memory consumption of the separate data-structures used by the Hybrid window is shown in Fig. 13 . By adding a smaller pure ring structure, the total memory consumption of the first structure is increased by a factor of 100. us, this ring is the main contributor to the overall memory consumption of the first structure ( Fig. 13a and Fig. 13b) . However, by increasing m to maintain the ratio N /m constant (e.g. 2 15 ), the total memory consumption of this pure ring, used with a counting sketch with width of 8192, will be less than 54 kB.
is number corresponds to just ≈ 3.8% of the memory available per stage on typical programmable hardware (1.4 MB).
e structure used to count the number of th · N /m occurrences consists of a single table and its memory consumption is similar to the memory consumption of the initial counting structure (Fig.  13a) .
e total memory consumption of this data-structure is ≤ 20 kB for all analyzed values of m, width and th (m < 500, width < 8192 and 0.1% ≤ th ≤ 1%) .
A comparison of the total memory consumption of the Hybrid ring and the Ring window (Sec. 4.1) is shown in Fig. 14. e biggest contributor to the overall memory consumption is the bit array. Its memory consumption scales with N and for the value of N > 2 23 it reaches the hardware limit (1.4 MB).
EVALUATION 5.1 Experiment setup
We implemented and evaluated our approaches on both a Netronome smartNIC as well as by simulation in Python. e 5-tuple consisting of the source IP, destination IP, layer 4 protocol, source port, and the destination port were used as unique flow identifiers. Our python implementation used the same hash function as the one used by Netronome cards (CRC CCIT). Different hash functions were created by appending seed values to the flow identifiers.
Traces. We classified heavy hi ers as flows whose frequency was above a threshold th that varied between 0.1% -1%. Packets were obtained from 10 different traces from an ISP backbone link collected at the Equinix data-center in Chicago in January 2016, made available by CAIDA [1] . Each trace is one minute long and contains on average 31 million packets.
Metrics. We evaluated all our presented counting and sliding window solutions on: (1) percentage of false negatives (percentage of packets that were not reported as belonging to a heavy hi er flow but should have been), and (2) false positives (percentage of packets that were reported as belonging to a heavy hi er flow, but should not have been).
Comparison baselines. We compared our Gated sketch against the Count-Min sketch with the same memory consumption. e Count-Min sketch was chosen as a baseline algorithm. We compared our sliding window approaches to simply periodically resetting all registers and se ing them to 0, since we are not aware of any other P4 solution that implements sliding windows.
Counting sketches: Accuracy
Count-Min sketch. e number of false positives mostly depends on the width of the sketch (Fig. 15) . Increasing the width of the Count-Min sketch reduces the number of hash collisions and, as a direct result, reduces the count overestimation. False positives also decrease with the depth of the sketch, but not significantly.
Gated sketch. Gated sketch outperforms the previously implemented Count-Min sketch in both accuracy and memory usage. Its accuracy mostly depends on the thresholds and widths used at each stage. is is especially true for the first threshold (th0) and width (width0) since they have a significant influence on the number of packets processed in the later stages (as can be seen in Fig.  16b) .
A higher th0 reduces the number of collisions in deeper stages, increasing accuracy and decreasing the number of false positives (Fig. 16a) . Additionally, by increasing width0 the number of packets processed in the deeper stages due to collisions is reduced (as the load factor is reduced similarly to the Count-Min sketch). As a consequence, only heavy hi er flows and smaller flows that collide with them in the first stage are processed in the deeper stages. Since only a small fraction of packets is processed by the deeper stages, their width can be reduced without losing much accuracy.
is reduces the overall memory consumption, making it possible to trade-off the width of the deeper stages for the width of the first stage (Fig. 16c) . Additionally, smaller flows that pass through the first stage (due to collisions) are filtered in the deeper stages, since the probability of them colliding with another heavy hi er flow in all deeper stages is reduced.
For example (see Fig. 16a ) for a window of size 65536 and a threshold equal to 0.1% (65 packets) the percentage of false positives of a Gated sketch with a width 0 of 4096 and width 1 of 2048 varies between ≈ 55% (threshold th0 set to 10) and ≈ 7% (threshold th0 set to 60). At the same time, the Count-Min sketch with a depth of 3 and width of 4096 (thus, with the same number of count entries) does not achieve a lower percentage of false positives than 14%.
Sliding window: Accuracy
Flushing. We tested the accuracy of the counting sketches when the structure was flushed every N packets (Fig. 17) , as this is the most commonly used method found in the literature ([20] ) to clear data-structures.
is method is used as the baseline to compare our solutions to. If used with a Count-Min sketch with a width of 8192, the probability of false positives is low (less than 1.2%) and decreases with the increase of the window size N (to 0.6% for N = 2 19 ). e reason for this is that the used threshold to identify the heavy hi ers, configured as N ·0.1%, increases with N (e.g., 524 packets for N = 2 19 compared to 32 packets for N = 2 15 ). However, the probability of false negatives can be significant and depends on the window size N . For lower values of N , the counting structure is reset more frequently and identified heavy hi ers forgo en more o en thereby increasing the probability of false negatives from 4.2% for N = 2 19 to 6.2% for N = 2 15 .
Ring window. e ring window has the best accuracy among all the analyzed solutions. e probability of false negatives is equal to 0 for all the analyzed values of width, N , and th. is is expected as the counting sketch (e.g. Count-Min sketch) only Figure 17: Comparison of the probability of false negatives for the Count-Min sketch with width = 8192, depth = 3 and th = 0.1% and which is reset every N packets. Confidence interval is equal to 95%.
stores the values of the last N packets. us, counts can only be overestimated, and never underestimated. e probability of false positives is mostly influenced by the width of the sketch (Fig. 18c) . A larger width decreases the number of hash collisions, and with it the number of false positives. Similarly, increased depth reduces the number of false positives. However, the influence of depth is less significant than that of the width.
Sequential window. is solution performs worst of all analyzed solutions. In contrast to the ring implementation, many entries in the tables can be 0 (depending on the width of the table) causing the total number of counts increased per window to be larger than the total amount of counts reduced. As these counts are never removed, accuracy decreases over time resulting in a significant number of false positives (between 40% and 55%) even for large width values. Moreover, the probability of false negatives is not equal to 0 (as in the Ring window). is is because the entries that are reduced by the sequential window, do not necessarily correspond to the oldest packet of the window. However, the percentage of false negatives is significantly lower than the percentage of false positives (≤ 2%).
Sequential flushing. e probability of false positives and false negatives of the sequential flushing approach predominantly depends on two parameters: (1) the depth and (2) the width of the counting sketch.
By increasing the depth, the percentage of false positives decreases (Fig. 19a, Fig. 19b and Fig. 19c) . However, the percentage of false negatives increases (Fig. 19d, Fig. 19e and Fig. 19f ) and is higher than for all the other evaluated windowing approaches.
is is expected, as the probability that one of the counts (from a set of them that are associated with the incoming packet, i.e. one per each stage) is reset to zero is higher. Since we use the CountMin sketch in our experiments, the calculated minimum is below the threshold (th · N ). When an entry belonging to a heavy hi er flow is flushed, it will take at least N ·th packets to reach the heavy hi er threshold again. In the meantime, the heavy hi er flow will not be identified as such.
By decreasing the width, the number of false negatives reduces. e load factor per hash table is higher, and more collisions occur reducing the time needed for the counts to reach their previous value (Fig. 19f) .
Hybrid window. We have analyzed two different versions of the Hybrid Window: (1) without the initial ring data-structure of size N /k and (2) with the initial ring data-structure.
e accuracy of the first solution (without the initial ring datastructure) has lower accuracy than the analyzed ring window, but outperforms all the other analyzed solutions. e increased value of the false positives is due to the fact that smaller flows slowly accumulate in the counting sketch, before finally being removed a er they hit th · N /m. is probability is decreased with the increase of th, N and width, similarly to the ring window (Fig. 20a,  Fig. 20b and Fig. 20c) .
However, the advantage of this solution is that the probability of false negatives is equal to 0. is is expected since the oldest batch of th · N /k packets is always removed from the second counting structure (that counts the number of th · N /k occurrences). As a consequence, the total count can only be overestimated, and never underestimated.
To reduce the number of false positives, a smaller pure ring of size N /m can be added to remove packets added to the initial counting sketch (not from the th · N /m counter) a er N /m packets (second evaluated solution). e accuracy, with this initial ring data-structure, is comparable to the previously analyzed ring window. For a width of 8192, the probability of false positives is under 2% for all the analyzed values of N and th (Fig. 20a and Fig. 20b) . Similarly to the ring window, the probability of false positives is reduced with the increase of the width (Fig. 20c) .
However, in contrast to the ring window, the probability of false negatives is no longer equal to 0%. False negatives occur when the initial ring window reduces the count for a heavy hi er flow a er N /k new packets, while the count remained under th · N /k and was never added to the second hash table (that counts the number of batches th · N /k). However, this is corrected in the next few packets belonging to that flow and the probability of false negatives is smaller than 0.6% for all the analyzed values of th, width and N .
Influence of parallel processing
We evaluated our solutions using a Python program as well as by implementing the algorithm in P4 and testing it on a Netronome SmartNIC. We verified that our P4 code produced the same results as our Python implementation using multiple artificially generated packet traces by ensuring that the hash tables were identical in both cases at the end of the measurement interval and that the same packets were identified as belonging to a heavy hi er flow.
A erwards, we evaluated our solutions using the CAIDA traces by generating packets at higher rates to evaluate the influence of parallel processing. Netronome SmartNICs process all incoming packets in parallel (60 microengines on Netronome cards in our testbed), leading to race conditions in cases when multiple microengines try to access the same register memory. e accuracy is decreased, and the probability of false positives and false negatives is increased. However, this difference is not significant (less than 0.5%). 
RELATED WORK
Calculating frequent items in a datastream is a well researched problem and many algorithms have been proposed over the years. We can divide them into three main groups: (1) sampling algorithms, (2) algorithms based on sketches, and (3) counting algorithms.
Sampling algorithms (NetFlow [10], Sflow [19] , Sample&Hold [14]) are currently widely deployed and used by network operators, but have some well-known limitations, like the trade-off between scalability, overhead, and accuracy. In these algorithms, nodes usually maintain current flow statistics that are periodically sent to a remote collecting point that performs detailed analysis. However, especially in core routers that process huge amounts of data, they can cause significant bandwidth, CPU, and memory overhead if the sampling rate is not set high enough [9] . Several modifications, that address these problems, were proposed in [16, 25] .
Algorithms ey usually have lower memory requirements and can reduce processing time per packet, while processing every packet in a large stream of packets at the same time. However, they reduce accuracy causing potential overestimation or underestimation of the flow frequencies. Additionally, many of these algorithms ([13]) can not be easily and efficiently implemented on programmable switches (using languages such as P4 [7] ) and require specialized hardware.
Counting algorithms (Hashpipe [20], Space-Saving Algorithm [18], CSS [5]) maintain a data-structure consisting only of heavy hi er flows and corresponding counts. Space-Saving algorithm is considered state-of-the-art in this group of algorithms as it has the lowest memory usage possible (O(k)) for a fixed accuracy among deterministic heavy hi er algorithms [5, 20] . It uses very simple actions (additions or subtractions), but requires either maintaining a sorted list or finding an item with the minimum counter value among all possible entries in the table. Unfortunately, both options are either not supported by existing programmable hardware or exceed the available processing budget. CSS improves the Space-Saving algorithm by using only statically allocated memory and by supporting constant time point queries [5] . However, datastructures such as the TinyTable [13] were not developed with P4 in mind and cannot be efficiently maintained within the available time budget. Hashpipe [20] uses a set of hash tables to count every packet received by the switch. It was designed for P4 and, as such, has very low processing overhead and memory consumption. 
CONCLUSION
To avoid drops in throughput, programmable networking hardware comes with a set of specific constraints that need to be taken into account when designing new switch applications, such as a limited number of memory accesses as well as a very limited amount of memory to store stateful information. Most of the existing solutions to detect heavy hi ers focus only on low memory overhead and do not take into account the limited number of memory accesses (typically just one read/modify/write per data-structure). To satisfy these new constraints, newer approaches, that maintain accuracy over time while having a low processing overhead and low memory consumption are needed.
By analyzing the Count-Min sketch, we realized that the memory used by the sketch could be distributed more optimally. All the packets were processed through a set of hash tables of the same size, leading to many collisions in all of them. is conclusion lead to the development of our own approach, the Gated Sketch. By using a set of hash tables, whose size decreases with the depth, and a set of thresholds, smaller flows are filtered in the first few tables. At the same time, as more memory is added to the first stages, the number of collisions is decreased. We have shown that our Gated sketch outperforms the Count-Min sketch with the same memory usage by lowering the number of false positives with a factor of 2.
Secondly, we focused on maintaining a high accuracy of the counting sketch over time without any intervention of the controlplane by using a sliding window implementation. We showed that the current approach (with the controller intervention), to flush the counting data-structure every T seconds, has many drawbacks. For higher values of T , the percentage of false positives had a significant increase (up to 80% of all packets processed by the switch). Additionally, on switches with high processing overhead, this approach required too many controller interventions making it unfeasible for smaller values of N (window size). To counteract these issues, we developed and designed multiple different sliding window solutions and evaluated their accuracy for different values of N and the heavy hi er threshold. We have shown that it is possible to maintain a high accuracy over time (Sequential Flushing and Hybrid Window) while taking into account the previously mentioned hardware requirements.
