ABSTRACT Sketch-based data streaming algorithms are used in many network traffic monitoring applications to obtain accurate estimates of traffic flow statistics. However, the current implementation of sketch-based methods in network monitoring is too application-specific. Their flexibility is limited as the hardware implementation of sketch data structure in the network device depends on the measurement tasks. The sketch counters only summarize flow statistics that are relevant to a specific measurement task and cannot be reused for different measurement tasks. In this paper, we propose FlexSketchMon, a system designed to provide the flexibility of using various sketch-based algorithms for traffic monitoring and measurement tasks. FlexSketchMon leverages on a novel data plane architecture that collects traffic flow statistics and provides arbitrary flow aggregations to the monitoring applications. The data plane design comprises a flow counter table and a flow key table for storing flow-level data. FlexSketchMon is implemented on the NetFPGA-SUME platform and is capable of processing network traffic at line rate in a worst-case scenario corresponding to a 64-byte minimum Ethernet frame size. The update of the flow counter table, which is the critical path in the proposed system, can achieve a throughput of 96 Gbps. The simulation results based on a real-world network traffic traces for three monitoring applications-estimation, superspreader detection, and heavy hitter detection-are presented to demonstrate the performance of FlexSketchMon. The results show that FlexSketchMon yields comparable and better measurement accuracy compared to previous approaches.
to obtain traffic measurement results. A sketch is a synopsis data structure typically used in algorithms that process data streams [17] , which are also known as data streaming algorithms [18] . The purpose of sketch or other synopsis data structures is to create a compact summary of a data stream to avoid the impracticality of storing all of the items in the data stream in memory for processing, particularly when the size (or length) of the data stream is very large. Computations can then be applied to this compact summary instead of the original data stream.
The sketch data structure uses an array of hash tables in which each bucket stores a counter [19] , [20] or, an array of bits (bitmap) [21] . The sketch has a small and fixed memory size, of typically up to several kilobytes, making it suitable for implementation in a cache or low latency memory (SRAM). Generally, there are two main operations in a sketch-based algorithm: update and query. The first operation updates the sketch using the key and value of each item in the data stream; the latter is used to obtain the statistics of the data stream (e.g., the most frequent items or the number of distinct items in the data stream) from the sketch following the update operation. Each sketch-based algorithm uses different methods for implementing its update and query operations.
In network monitoring applications, sketch-based algorithms offer the advantage of memory space efficiency. Using only a small amount of memory space, a sketch is capable of summarizing network traffic by updating its counters for each incoming packet without packet sampling. The update operation is the bottleneck in the application of sketch-based algorithms to network traffic monitoring and typically, implementing the sketch data structure in the SRAM of the network device line card ensuring a line-rate update process. After the update process, a specific traffic measurement task can be performed by processing the sketch using sophisticated algorithms or statistical methods (i.e., the query operation).
Sketch-based data streaming algorithms are efficient and have provable memory-accuracy tradeoffs in traffic flow measurement. However, they are too application-specific and lack the generality for implementation as primitive operations on network devices [22] , [23] . Sketch-based monitoring requires a separate sketch data structure for each flow key, which is determined by the measurement tasks of interest. The current implementations of sketch-based algorithms on network devices are not flexible. Typically, the sketch data structure is implemented in hardware for specific measurement tasks. Other monitoring applications cannot reuse the data structure because it is constructed based on the different flow keys and update values. Despite their advantages in traffic measurement, very little attention has been paid to the development of a generic hardware architecture that allows the flexible usage of sketch algorithms in network traffic monitoring systems. This paper presents FlexSketchMon, a system that supports greater flexibility in the use of different sketch-based algorithms for traffic measurement tasks, to address the aforementioned problem. This is an extension of our preliminary work [24] , in which we described the design of a flow counter table for a flexible sketch-based traffic monitoring system on the older Xilinx Virtex-5 NetFPGA-10G platform. Here, we provide details of the system, flow counter table and flow key table designs, and also the evaluation results using monitoring applications on the newer Xilinx Virtex-7 NetFPGA-SUME platform. FlexSketchMon comprises an FPGA-based hardware data plane and applications that run on the host CPU. The hardware data plane is designed for flow-level data collection without packet sampling and comprises two tables: the flow counter table and the flow key table. During each observation interval, the flow counter table maintains the flows' packet and byte counts while the flow key table stores 5-tuple flow keys. At the end of each observation interval, the contents in these tables are read by the host CPU. Therefore, flow keys can be selected and processed based on the requirement of various sketch-based monitoring applications in great flexibility. A high-level comparison between FlexSketchMon and the current implementation of sketch-based monitoring is shown in Figure 1 . In the current implementation of sketch-based monitoring, each application has its own flow counters that are applicable only to its monitoring task; conversely, the flow counters and keys can be shared among monitoring applications in FlexSketchMon.
The main contributions of this work are summarized as follows:
• A generic architecture for flexible sketch-based network traffic monitoring is proposed. An FPGA-based prototype of the hardware data plane that utilizes two tables and a Bloom filter as its main components for collecting traffic flow statistics is designed and implemented on the NetFPGA-SUME platform [25] . The data plane can process packets at line rate in the worst-case scenario of a 64-byte minimum Ethernet frame size.
• A flow counter table scheme that uses a hash table with linear probing to resolve collisions on the QDRII+ SRAM is designed and implemented. The flow counter table can support up to two million flow entries in an observation interval, with a throughput of 96 Gbps.
• A tabulation-based 5-universal hash function that enables linear probing with a constant probe length per operation is implemented. Using the 5-universal hash function, most of the flows have probe lengths of less than four. With four-word burst operations in QDRII+ SRAM, the majority of flow lookup operations can be done in one memory access. Therefore, the flow counters can be updated in a deterministic time, allowing for line rate update operations.
• A flow key table for storing the 5-tuple flow key on the DDR3 DRAM is designed and implemented. The flow key table can accommodate up to 24 million flow keys during an observation interval.
• The hardware data plane can be used for various monitoring applications without reprogramming or redesigning the data plane for new applications that require measurement of different flow keys. Monitoring applications can flexibly select the flow keys to be used in their measurement tasks.
• Performance of the hardware data plane prototype is evaluated. Monitoring applications are developed to demonstrate that the flow-level data and statistics provided by the hardware data plane can produce accurate measurements for the purposes of entropy estimation, superspreader detection, and heavy hitter detection.
The remainder of this paper is organized as follows. Section 2 presents the background on data streaming algorithms, sketch-based applications for network traffic monitoring, and discusses the related works. Section 3 describes the architecture of FlexSketchMon and presents a software-based simulation of the FlexSketchMon data plane. Additionally, this section also discusses trace-driven simulation results obtained using a real traffic trace. Section 4 describes the hardware implementation of the FlexSketchMon data plane. Section 5 discusses the evaluation of FlexSketchMon and presents the performance test results. Three monitoring application examples are developed and evaluated to demonstrate the measurement accuracy of FlexSketchMon. Comparisons to previous works are also presented. Finally, Section 6 concludes this paper.
II. BACKGROUND AND RELATED WORK A. DATA STREAM AND STREAMING ALGORITHMS
A data stream σ = (a 1 , a 2 , . . . , a n ) is a massive sequence of n items, with each item a i comprising a key k i and an update v i . Associated with each key is a time varying signal [9] , [26] . Network traffic can be viewed as a data stream in which packets arrive as a series of items. The keys k i can be defined using any combination of fields from the packet header; and the updates v i given in terms of packet count, packet size or other appropriate application-dependent values.
Streaming algorithms process and perform computations on data streams. In data stream computation model, the input is presented as items that arrive sequentially with repetitions [26] . Typically, streaming algorithms have limited memory space relative to the data stream length and per-item processing time. Another constraint in high-speed network traffic monitoring applications is that the items (packets) in the data stream (network traffic) must be examined in one pass because it is impossible to store packets due to the limited memory space in network devices. Therefore, such algorithms employ synopsis data structures such as the sketch data structure.
B. SKETCH AND NETWORK TRAFFIC MEASUREMENT
Sketch is a data structure that represents a data stream in a compact manner [15] . For example, the Count-Min sketch [20] is an array of counters
Each row is associated with a hash function that is selected from a universal class of hash functions [27] . When an item a = (k, v) in the data stream arrives, its key k is hashed by d hash functions and the corresponding counters are updated using the value v. The update operation is defined as ∀i,
After the update process, the sketch can be queried to obtain statistics about the data stream. The query process typically applies several mathematical or statistical operations to the data collected by the sketch and returns the estimated item (packet) counts or the accumulated value (byte counts) for key k. For example, the query of a key k to Count-Min sketch [20] returns the minimum value among all of the counters that correspond to that key. Note that each counter in the sketch data structure is used by all keys hashed to that counter. Therefore, the query result for a key is an estimated value. The estimation error (i.e., how much the estimated value differs from the true value) is determined by the sketch size. The values of d and w must be chosen appropriately to obtain a desirable estimation error. Figure 2 shows the sketch update and query operations. One of the more important properties of the sketch data structure is its linearity, which allows sketches from different observation intervals can be combined using arithmetic operations. Some sketches utilize bitmaps instead of counters [21] , [28] and are typically used for estimating the number of distinct items in a data stream; herein, the update operation sets the bit in the bitmap.
Network traffic can be viewed as a data stream in which packets arrive sequentially to the network interface [26] . In network traffic measurement, the sketch data structure is used to summarize the network traffic by processing all packets without sampling with fixed memory size. The sketch is updated using a predefined flow key, which is a field or a combination of fields from the packet header. The updated value can be the packet size, packet count or any other appropriate value. For example, in heavy hitter detection, the sketch uses the source IP address and the packet count/packet size as its key and updated values, respectively. In heavy change detection [9] , the sketch is updated using the source IP address and the packet size. In superspreader detection [12] , [29] , [30] , the update process sets the bit in the bitmap sketch based on the source IP address and the source-destination IP address pair. Subsequently, the sketch can be queried or processed by monitoring applications to obtain accurate statistics of interest from the network traffic such as the total traffic volume from a source IP address, the total packet count sent to a destination IP address, or other flow-level metrics. It is worth mentioning that, as the sketch data structure does not store flow keys, either reversible hashing or reversible sketch [31] or other techniques must be used to obtain the keys for querying the sketch.
C. RELATED WORK
As summarized in Table 1 , existing approaches overcoming the inflexibility of sketch-based traffic monitoring consist of the UnivMon [32] , [33] , OpenSketch [22] and the ''minimalist'' approach to flow monitoring [23] . Sekar et al. proposed the ''minimalist'' approach [23] in which the flow-level data are collected by using a combination of flow sampling (FS), sample-and-hold (SH), and coordinated sampling (cSamp) methods. The goal of the ''minimalist'' approach is to minimize the complexity and resource requirements for flow monitoring on a network device through the application of these simple methods. The flow counters stored in SRAM are divided for both FS and SH to minimize SRAM usage. Unlike application-specific approaches in which each monitoring application has its own counters, in the ''minimalist'' approach, counters can be used by all monitoring applications. Based on trace-driven evaluation, the ''minimalist'' approach can achieve a level of accuracy similar to those of application-specific approaches using the same memory and computing resources. However, the authors of [23] did not implement the ''minimalist'' approach in hardware, and only provided assumptions and justifications regarding the feasibility of hardware implementation, processing requirements, and memory consumption.
In line with Sekar et al. [23] , Liu et al. [32] , [33] developed a universal monitoring architecture (UnivMon) in which a simple, generic monitoring technique runs on the network device. Therefore, results can be obtained for a broad spectrum of monitoring tasks with an expected degree of accuracy equivalent to, or better than, those of approaches using custom sketches (i.e., the application-specific approach in [23] ) for each monitoring task. The UnivMon data plane utilizes sampling and a parallel instance of Count sketches [19] to collect flow-level data, while the control plane sends manifests to network devices (data plane) in the network to define the monitoring responsibility of each device (e.g., the number of sketch instances and the flow keys to be used) in the flow-level data collection process. After each observation interval, the control plane collects all of the sketches from the devices (data plane) on which monitoring applications can run estimation algorithms to estimate the traffic flow metrics of interest.
OpenSketch [22] uses a hash-based measurement data plane and sketch data structure to collect flow-level data. The goal of OpenSketch is to provide a generic and efficient means of measuring network traffic by separating the measurement control plane functions from the data plane in software-defined networking (SDN) context. OpenSketch supports customized flow-level data collection in which flows to be collected and measured are selected using TCAM-based classification. After each observation interval, the sketch counters that are stored in SRAM are sent to the controller (the control plane, in which the measurement libraries and monitoring applications reside) for further analysis. A related system is FlowRadar [34] , which, although not directly targeted to the problem of sketch inflexibility, can serve as a generic flow monitoring tool for data centers. FlowRadar is designed to monitor all flows without sampling in short time scales (e.g., 10 ms). In FlowRadar, per-flow counters are encoded to achieve small memory usage and a constant flow insertion time. These encoded flow counters are then sent to remote collectors for decoding and analysis. The flow table in FlowRadar is based on an extension of the invertible Bloom lookup table (IBLT) [35] , which can maintain flow keys and counters as key-value pairs.
FlexSketchMon has a data plane design and a mechanism for maintaining flows and their counters that differ from those used by UnivMon and OpenSketch. The data plane of FlexSketchMon does not use sampling and sketch instances to collect flows. In contrast, the UnivMon data plane uses sampling and instantiates several Count sketch instances according to the number of traffic substreams, which makes it increasingly inflexible as the number of traffic substreams grows. UnivMon only provides limited flexibility to the monitoring applications. In its current implementation, UnivMon maintains independent sketches for various measurement tasks of interest and uses TCAM, which is limited in size, to store the flow keys. Unlike OpenSketch, FlexSketchMon uses two tables in the data plane to collect flow-level data. In OpenSketch, the flow counter table design in SRAM is divided into a list of logical tables because different sketches require different numbers and sizes of counters. Therefore, the data plane of OpenSketch actually maintains counters for many sketch instances. This approach makes the counter access mechanism (addressing) more complex and requires indexing based on hashing and classification modules. Furthermore, each entry in the table contains only counters without a flow key and monitoring applications that need to identify a specific flow key must use techniques such as reversible hashing or reversible sketch [31] to obtain the flow key, which reduces the accuracy. By using TCAM-based classification prior to the counter table update process in SRAM, OpenSketch limits the flows to be measured.
FlexSketchMon is similar to FlowRadar in that it can be used to monitor all flows on short time scales, although it focuses on providing flexibility for sketch-based algorithms. However, the mechanism for maintaining the flow keys and counters in FlexSketchMon differs from that in FlowRadar. Instead of embracing the hash collision in the flow table update process as utilized by FlowRadar, collisions are resolved using a sophisticated hash function (the 5-universal hash function) and linear probing. Additionally, unlike in FlowRadar, flow counters encoding and decoding are not used by FlexSketchMon because they can generate inaccurate counter values and thereby affect the accuracy of measurements.
In summary, the features that make FlexSketchMon competitive and superior compared to the other approaches listed in Table 1 are summarized as follows:
• The data plane of FlexSketchMon is generic; in other words, it does not use sampling or maintain independent sketches for each measurement task of interest. Instead, the data plane collects all flows without sampling. The flow counter table stores the exact packet and byte counts; the flow key table stores almost all (more than 99%) of the 5-tuple keys during each observation interval. The flow-level data collected in both tables can be used by either sketch-based network monitoring applications or other monitoring applications that require the flow key and counter data.
• A collision resolution technique is implemented to handle hash collisions in the flow counter table update process; this ensures that all flows in an observation interval are collected.
• As the data in the flow counter and flow key tables are not encoded, decoding process is not needed to recover the original flow keys and counter values. Therefore, there are no additional overheads for encoding and decoding processes.
Unlike FlexSketchMon, the minimalist approach uses sampling; UnivMon employs sampling, sketching, and maintain independent sketches for various measurement tasks in the data plane; OpenSketch uses sketching in its data plane and relies on TCAM-based classification to select specific flows to be measured and does not incorporate a collision resolution technique; FlowRadar utilizes counter encoding/decoding and does not handle hash collisions. Additionally, FlexSketchMon also accelerates sketch-based traffic measurements by aggregating flows at 5-tuple granularity. Using the flow-level data that are accumulated in the flow counter and flow key tables, the sketch update operation on the monitoring application side can be performed faster because the operation is executed in term of flows, instead of packets.
III. FLEXSKETCHMON ARCHITECTURE A. OVERVIEW
The architecture of FlexSketchMon is shown in Figure 3 . The top part is implemented in software on the host CPU, while the bottom part is implemented in FPGA hardware. The data plane consists of two main tables: the flow counter table stores the flow ID, packet count, and byte count; and the flow key table stores the 5-tuple flow keys. When a packet arrives, its source and destination IPs, source and destination ports, and protocol (5-tuple) values are extracted as the flow key by the header parser module and hashed by the hash function module to obtain the flow ID. The flow counter must be updated in a fixed number of memory cycles for line-rate packet processing. To meet this requirement, a hash table with linear probing for collision resolution is used to implement the flow counter table. Linear probing is used because of its simplicity of implementation (it supports very fast lookup and insertion using the four-word burst operations of the QDRII+ SRAM). The 5-universal hashing method is used to implement the hash function. Recent studies have shown that a 5-universal hash function can be used to implement linear probing with robust performance and deterministic lookup and insertion times. As the number of probes per update is a constant value of no greater than four [36] , [37] , the flow counter table can provide a timely counter update.
A Bloom filter is utilized to check the presence of each flow's 5-tuple in the flow key table to avoid the storage of duplicated 5-tuple data. If a flow for the current observation interval is not present in the Bloom filter, the 5-tuple flow key Algorithm 1 Algorithm for the Flow Counter Table  Implementation 1: Initialize all entries in the flow counter table CT to 0 2: for each packet pkt in the trace file do /* 1st slot in the bucket is empty */ 12:
Update the packet counter of f 14: Update the byte counter of f
15:
else 16 : 
Update the packet counter of f 35: Update the byte counter of f 36:
end if 38: end if 39 : end if 48: end if 49: end if 50: end for is added to the flow key table. Bloom filter is chosen because of its space efficiency in determining set membership (i.e., checking whether an item is a member of a set) at the cost of table and flow key table. small false positive results. It requires less memory compared to a hash set or hash table. The flow key table is constructed based on a FIFO structure in the DDR3 DRAM. The flow processing to the flow counter table in SRAM and the flow key table in DRAM are performed in parallel. Therefore, while the flow's packet and byte counters in SRAM are updated, its 5-tuple key is also checked to the Bloom filter and stored in the flow key table in DRAM. Note that the 5-tuple key is processed to the flow key table in DRAM based on the result returned from the Bloom filter. At the end of each observation interval, the flow-level data that have been collected in the two tables are moved to the host CPU for utilization as required by the monitoring applications that implement sketch-based algorithms in their measurement tasks.
In using flow-level data collected in the data plane, monitoring applications read the flow keys from the flow key table and look for their corresponding packet/byte counts from the flow counter table based on the flow IDs. Accordingly, the size of the Bloom filter should be chosen appropriately, as the filter size influences the number of flow keys stored in the flow key table, which in turn affects the measurement accuracy. The monitoring application can obtain the flow IDs by hashing the 5-tuples that are read from the flow key table using the same hash function used by the hardware. The flow ID is then hashed to obtain the counters in the flow counter table (also using the same 5-universal hash function used by the hardware). This process is shown in Figure 4 . In our hardware implementation, we also store the flow IDs in the flow key table along with the 5-tuple flow keys, allowing the monitoring applications to obtain flow ID directly from the flow key table. Consequently, the monitoring applications do not need to hash the 5-tuple to get the flow ID.
The flow counter table stores the packet and byte counts of all flows. However, the flow counter table might not be fully utilized by the monitoring application if the corresponding flow keys are not recorded in the flow key table. This occurrence depends on the applications' measurement tasks. For example, a monitoring application that measures the entropy of a flow key will use data from both tables because it requires the flow key and its counters to calculate the entropy. Conversely, an application that detects and identifies superspreaders uses only the source and destination IP address data from the flow key table.
Consider the case of using the source IP address for entropy estimation. The monitoring application first reads the source IP addresses from the flow key table. It then looks for the packet count for each source IP address in the flow counter table using the flow ID as the lookup key. After obtaining the packet counts, the monitoring application can process the source IP address and calculate the entropy.
B. SIMULATION AND RESULTS
Before implementing the FlexSketchMon data plane in hardware, we simulate it using software with the primary goal of evaluating the data plane design. The simulation is used to evaluate the flow counter table update operation, including the probe length when a flow is inserted into the flow counter table and the table's load factor during an observation interval. The impact of Bloom filter size on the number of 5-tuple flow keys that are successfully recorded into the flow key table is also observed. Algorithms for the implementations of the flow counter and flow key tables are presented in Algorithm 1 and Algorithm 2, respectively. In the simulation, the size of the Bloom filter is varied from 128 KB to 1 MB (2 20 to 2 23 bits), and the flow key table can store up to 512K (2 19 ) flow keys. The number of buckets in the flow counter table is set to 512K (2 19 ) with each bucket can store up to four flow entries following the four-word burst architecture of the QDRII+ SRAM on the NetFPGA-SUME platform. Note that in this paper, we conventionally use 1K and 1k to indicate 2 10 and 10 3 elements, respectively.
The insertion process of the flow counter table is shown in Figure 5 . Each flow ID, x, is hashed by a 5-universal hash function to obtain its bucket in the table. Each bucket can accommodate up to four flow entries and, if the first slot in the bucket is empty, the flow is stored there. When a collision occurs in a slot of a bucket, the insertion algorithm probes the next slot to find an empty place for the flow. Because the first three slots in the bucket (shown in Figure 5 ) are already occupied, flow ID x is inserted into the fourth slot in the bucket. If no empty slots available in the bucket, the insertion fails, and the flow must be stored in another storage or in an overflow list (stash), which is typically implemented in on-chip Add f and T to KT 10: end if 11: end for memory [38] . In the hardware implementation (described in next section), the four-word burst of the QDRII+ SRAM allows for all of the slots in a bucket to be accessed in one memory read/write operation; accordingly, the probing process to find an empty slot for a flow in its bucket can be performed in one memory access.
Trace-driven simulations using a CAIDA trace [39] are conducted to observe the table's load factor. The CAIDA trace, comprising 30.8 million packets and 1.4 million distinct 5-tuple flows with a duration of 60 s, was collected over a 10 Gbps link of a Tier 1 ISP between San Jose and Los Angeles. The observation interval in the simulation is set to 15 s (i.e., the trace is divided into four observation intervals, Interval 1 to 4), with each interval containing about 7.7 million packets with an average of 476k distinct 5-tuple flows. The experimental results (shown in Figure 6 ) reveal that using a 5-universal hash function, most of the flows have a probe length of less than or equal to four. In a 15 s observation interval, 99.6% of the flows met this condition, enabling most of the flows to be stored in the table and forcing only a small fraction (fewer than 2,048 for each observation interval) of the flows must be stored in another storage. In this case, a stash that can store up to 2K flow entries is needed to accommodate all flows in each observation interval. Typically, in a hashing scheme such as Cuckoo hashing [40] that allows multiple moves, a small stash size (2 to 64 entries) is sufficient. However, if the hashing scheme does not move keys among tables as in the design presented in this paper, a larger stash size must be used. For hardware implementation, moving keys among tables is expensive with respect to memory access time, particularly if the tables are stored in off-chip memory.
Note that a Bloom filter is used to check for the presence of a flow ID before its corresponding 5-tuple flow key is added to the flow key table. As a result of false positives in the Bloom filter, the number of flows in the flow key table will be less than the number of flows stored in the flow counter table. 
IV. FLEXSKETCHMON IMPLEMENTATION A. DEVELOPMENT PLATFORM
The hardware prototype is implemented on the NetFPGA-SUME platform [25] , which is based on a Xilinx Virtex-7 XC7V690T FPGA. The NetFPGA-SUME card provides four 10 Gb Ethernet SFP+ interfaces, a PCIe Gen3 x8 interface that supports an 8 Gbps/lane throughput, three 9 MB Cypress CY7C25652KV18-500BZC QDRII+ SRAMs, and two Micron MT8KTF51264HZ-1G9 4 GB DDR3 DRAMs. The board is attached to the PCIe slot of a machine that is equipped with a quad-core CPU Intel Core i7-4790 that runs at 3.6 GHz with 8 GB DDR3 DRAM. The FPGA is clocked at a frequency of 200 MHz; and the input clock frequencies to the SRAM and DRAM controllers are 200 MHz and 233 MHz, respectively.
The hardware is developed by adding new hardware modules (IP cores) to the NetFPGA-SUME Reference Network Interface Card (NIC) [41] . Figure 8 shows the newly designed IP cores of FlexSketchMon on the NetFPGA-SUME NIC. The nic_broadcast IP core distributes the incoming packets to three IP cores, including the output_port_lookup IP core, thereby maintaining the original function of the NetFPGA-SUME Reference NIC. The nic_ddr3a IP core provides Bloom filter and flow key table functionality, while the nic_qdr2ac IP core implements the 5-universal hash func- tion and the flow counter table. Details of nic_ddr3a and nic_qdr2ac IP cores are shown in Figures 9 and 12 , respectively. The IP cores in the NetFPGA-SUME platform communicate with each other using the AXI4-Stream standard.
The NetFPGA-SUME platform provides a register infrastructure that enables user-space applications that run on the host to read data from and write data to the hardware over the PCIe interface. The AXI4-Lite standard is used for register access. On the hardware side, the DMA IP core contains a DMA engine based on the RIFFA framework [42] and a PCIe endpoint block. The core employs two channels, one for sending/receiving packets and one for accessing (writing/reading) registers. On the software side, user-space host applications utilize the Linux ioctl system call to communicate with the SUME RIFFA device driver for reading and writing registers. The device driver is responsible for configuring the Base Address Register (BAR) and the other settings required for communication over the PCIe bus.
In the NetFPGA-SUME Reference NIC design, each of the four 10 Gb Ethernet ports has a 64-bit data path connected to an ingress queue that collects incoming packets from the network. The Input arbiter module takes packets from the four ingress queues using a round robin policy and sends them to the next module in the data path. The Input arbiter module outputs 256 bits (32 bytes) of data per clock cycle.
B. FLOW KEY TABLE IMPLEMENTATION
The flow key table that stores 5-tuple data is implemented in DRAM and relies on several designed hardware modules for its operation. The flow key table must store only one 5-tuple flow key (no duplication) and, employs a Bloom filter to avoid the duplicated storage of 5-tuple data. A read/write control module is used to manage the read/write access to the table in DRAM. Hash modules are implemented to hash the 5-tuple flow key and the flow ID. The two 4 GB DDR3 DRAMs available in the NetFPGA-SUME platform are referred to as DDR3A and DDR3B, respectively. In our implementation, only DDR3A is utilized. Figure 9 shows the block diagram of the nic_ddr3a IP core and the modules are described as follows:
1) HEADER PROCESSOR AND TUPLE PREPROCESSOR MODULES
The header processor module receives incoming packets and extracts the source and destination IP addresses, source and destination ports, and protocol values from the header of each packet. These 5-tuple values are sent to the tuple preprocessor module, which constructs the 5-tuple data and then hashes it using the TabChar hash module to obtain the flow IDs. The flow IDs and corresponding 5-tuples are then stored in a FIFO. 
2) TABCHAR HASH MODULE
This module hashes each 104-bit 5-tuple input to obtain a flow ID using a 2-universal tabulation-based hash function (TabChar). The TabChar hashing computes hash values using table lookups and XOR operations. Each 104-bit input key x (5-tuple) is divided into thirteen 8-bit characters, x 0 , . . . , x 12 , and used as the index for table lookup. Thirteen tables T 0 . . . T 12 , each containing 2 8 32-bit random numbers, are needed to generate the hash value of x, which is calculated as follows:
This hash computation is fast, taking only one clock cycle. A small number of registers are used to store the lookup tables. The Bloom filter is implemented using the on-chip Block RAM (BRAM). The size of the Bloom filter is 1 MB (2 23 bits), arranged as a 2 18 × 32 bit array. Bitwise operations are performed to determine the 18-bit value as the input address to the BRAM and a controller manages the read/write process to the BRAM. The input to the Bloom filter module is the flow ID to be checked. If a checked flow ID is not present in the Bloom filter, the membership is updated, and the flow ID and associated 5-tuple are added to the post-check FIFO; otherwise, the controller notifies the FIFO controller and the post-check FIFO modules that the flow ID already exists. Figure 10 shows the block diagram of the Bloom filter module.
The hash module for the Bloom filter contains three 2-universal hash function blocks for hashing the flow ID to the Bloom filter in parallel. The hash value is calculated as follows:
where x is the flow ID, a and b are constant integer numbers, p is the Mersenne prime number 2 31 − 1, and k = 3 is the number of hash functions. A 64-bit × 32-bit multiplier core is used to multiply the flow ID by the constant a. By using a Mersenne prime number, the modulo operation can be replaced by logic operations (AND and shift) for fast computation and hardware-friendly implementation [43] . The output of the hash function is then AND-ed with 2 23 − 1 (i.e., modulo the Bloom filter size) to obtain the bit position to be checked in the Bloom filter. The hash function is implemented in a pipeline fashion, with four clock cycles required to compute the hash value.
4) DATA PROCESSOR AND DDR3A READ/WRITE CONTROL MODULES
The data preprocessor module reads the post-check FIFO and prepares the data for the DDR3A Read/Write control module. Because the DDR3 memory controller data interface is 512-bit wide, this module combines the flow IDs and 5-tuples with zero padding to construct a 512-bit data before sending it to the DDR3A Read/Write control module. The DDR3A Read/Write control module contains two asynchronous FIFOs (namely WRITE_AFIFO and READ_AFIFO) and a finite state machine (FSM) to control the read and write processes to the DDR3A controller module as shown in Figure 11 . In the write process, the 512-bit data received from the data preprocessor module are written to WRITE_AFIFO; once they are available in this FIFO, the FSM reads them and sends them to the DDR3A controller module. Along with the data, the write address and write command are also set by the FSM module. The DDR3A controller module then handles the memory write operation.
The read process is triggered by a register read access from the user-space host application. The reading of the register to a pre-specified address triggers an input read signal to the DDR3A Read/Write control module. The FSM then responds by sending the read address, read command, and other signals required for the read process to the DDR3A controller module, which handles the memory read operation. Upon arriving at the DDR3A Read/Write control module, the data read from memory are written to READ_AFIFO. The data postprocessor module then reads the 512-bit data from READ_AFIFO, extracts the flow IDs and 5-tuples, and writes them to the register FIFO. On the software side, the userspace host application monitors the register FIFO for data after issuing a read signal, and if the register FIFO is not empty, uses the register read mechanism to move the flow IDs and 5-tuples from the register FIFO to the host.
5) DDR3A CONTROLLER MODULE AND FLOW KEY TABLE
The Micron MT8KTF51264HZ-1G9 4 GB DDR3 DRAM is a single rank DRAM that contains eight 512 MB memory chips, each organized into eight banks with 65,536 rows apiece. Each row has 1,024 columns, with each column storing eight bits of data. The DDR3A controller module provides a user interface (UI) block that enables the DDR3A Read/Write control module to access the DDR3 memory. To carry out a memory write operation, the UI receives write address, write command, and 512-bit data from the DDR3A Read/Write control module. In a memory read operation, the UI receives the read address and command and then outputs the data read from memory along with a read valid signal. As the memory has a 64-bit interface, each 512-bit data can be sent to or retrieved from the memory in a single write or read operation with a burst length of eight (BL8). In our implementation, the data are stored sequentially from bank 0, row 0, col 0 to 1,023 then from bank 0, row 1, col 0 to 1023, and so on. Up to 24 million flow IDs and 5-tuple keys can be stored in the 4 GB DDR3 DRAM.
C. FLOW COUNTER TABLE IMPLEMENTATION
The three QDRII+ SRAMs in the NetFPGA-SUME platform are referred to as QDR2A, QDR2B, and QDR2C in the design. As the QDR2A and QDR2B share their FPGA banks, the three QDRII+ SRAM controllers cannot be used together because the Xilinx Memory Interface Generator (MIG) does not support bank sharing. Thus, when the MIG-generated controller is used, only two SRAMs can be used together (either QDR2A and QDR2C or QDR2B and QDR2C). Figure 12 shows the block diagram of the nic_qdr2ac IP core and the modules are described as follows: 
1) HEADER PARSER AND TUPLE PREPROCESSOR MODULES
The header parser module extracts the 104-bit 5-tuple and 16-bit packet length values from each packet and sends them to the tuple preprocessor module. As in the nic_ddr3a IP core, the 5-tuple data are constructed and hashed by the TabChar hash module to obtain the flow ID, which is then stored in a FIFO along with the packet length value. The next module reads the flow ID from the FIFO and hashes it to the flow counter table in SRAM using the 5-universal hash function.
2) 5-UNIVERSAL HASH MODULE
The 5-universal hash function can be computed using either the polynomial or tabulation method [43] . In the polynomial method, it is computed as follows:
where x is the flow ID and p is a Mersenne prime number. The polynomial hashing is performed in a pipeline fashion with Xilinx embedded multipliers. The modulo operation is replaced by bitwise AND and SHIFT operations to simplify the modulo circuit implementation. The polynomial hashing computation takes 14 clock cycles. Using the tabulation method, the hash value is calculated using bitwise XOR operations over precomputed values that are stored in lookup tables, enabling computation of the hash value in only one clock cycle. The hash value of x is computed as follows:
Seven tables are required to implement the tabulation-based 5-universal hash function for 32-bit keys using 8-bit characters. The indexes for table lookup are obtained from the input x (divided into four 8-bit characters, x 0 , . . . , x 3 ) and other three 8-bit characters, y 0 , . . . , y 2 , that are computed from the input x and a prime field. In the hardware implementation, the tabulation hash function is constructed using registers to store the precomputed values, multiplexers, and six XOR gates. The 32-bit output hash value is then AND-ed with 2 19 − 1 (i.e., mod 2 19 ) to obtain the SRAM address. Simple tabulation hashing [44] is another hash function that can be used for linear probing in provably deterministic lookup and insertion times. We have also implemented the simple tabulation hash function, which also takes only one clock cycle to compute the hash value. Using 8-bit characters as above, four lookup tables and three XOR gates are required to implement the simple tabulation hash function.
3) QDR2AC READ/WRITE CONTROL MODULE
The QDR2AC Read/Write control module, shown in Figure 13 , manages the memory read/write process and communication with the QDRII+ SRAM controllers. This module also handles the register read/write process from the user-space application that runs on the host, which can issue register read signals to read the flow counter table data over the PCIe interface. As in the nic_ddr3a IP core, the QDR2AC Read/Write control module contains two asynchronous FIFOs and a read/write controller, which is implemented as an FSM. In the READ state, a memory read is issued to obtain the data of the current input flow ID from memory; in the WRITE state, the packet and byte counts are updated and written back to memory.
Updating the counters in the flow counter table requires read, modify, and write operations for each incoming packet. Figure 14 shows the data path of these operations. The operations are executed in a three-stage pipeline: in the memory read stage, the counter value at the address given by the 5-universal hash result is retrieved from the table; in the second stage, the counter value returned from memory is modified; and, finally, in the memory write stage the updated counter value is written back to the memory. To avoid read after write (RAW) hazards that can occur during the counter update process, a circuit to handle RAW hazards adopted from [46] can be used. A RAW hazard occurs if an input ID enters the pipeline before the previous update of the same input ID has been completed. The circuit compares the current and previous input IDs and addresses (hash values) and then updates the counters accordingly to ensure that the value written to memory is correct. Data are forwarded from the memory write stage to the computation stage along with the data from the memory read stage.
The user-space host application can obtain the flow counter table data by using a register read to trigger an input read signal to the QDR2AC Read/Write control module. The data from memory are then written to the FIFO within the QDR2AC Read/Write control module. When the data are available in this FIFO, they are read by the Async FIFO reader module, which extracts the ID and packet and byte count. These data are then written to the register FIFO, where they can be read and used by the user-space host application.
4) FLOW COUNTER TABLE
Each flow entry of 72-bit data is stored across two QDRII+ SRAMs comprises a 22-bit ID, a 24-bit packet counter, and a 26-bit byte counter. The counter width is set based on our observation that, in the trace used in the experiments, the packet and byte counts over a short observation interval (≤ 15 s) are both less than 2 26 . The counter width can be easily increased when the other unused SRAM (QDR2B) is accessible by the MIG-generated controller. Figure 15 shows the layout of flow entries in memory. The flow counter table can accommodate up to two million flow entries.
Hash collision is resolved by comparing the input flow ID to multiple entries in a bucket in parallel, as shown in Figure 16 . The mechanism makes use of the four-word burst access of the QDRII+ SRAM. A read burst retrieves four 36-bit words (144 bits) from the memory, which allows a bucket with four IDs to be read in one memory access and compared in parallel to the input ID. If the input ID is a new ID that has not been previously stored in the table, a new flow entry is created and stored in one of four slots in the corresponding bucket if an empty slot is available. If the input ID is already present in the table, only its counters are updated and then written back to memory.
In the current hardware prototype, flows with probe lengths greater than four are not stored in the QDRII+ SRAM. This is based on our experimental results that if a 5-universal hash function is used, most of the flows have a probe length of less than or equal to four. To accommodate more flows and increase the table occupancy, multiple-choice hashing with an overflow list (stash) can be utilized [38] . The stash is used to store flows that cannot be accommodated in the table (i.e., those flows whose probe lengths exceed four in our implementation). The implementation of multiple tables (d ≥ 2) with stash in hardware requires more logic and memory resources than the single table implementation presented in this paper because d hash functions, d tables, and a small table for the stash are needed. Multiple tables can be implemented by dividing the SRAM space into, for example, 256K addresses for each table (d = 2) or 128K addresses for each table (d = 4). The stash can be implemented in hardware using a content-addressable memory (CAM). Designing a high-performance CAM on FPGA is challenging, the size of the CAM is limited to the storage of only a few flow entries. Based on an assumed requirement to store up to 2K flow entries, the required CAM size is 2K × 72 bits, which is approximately 18 KB in total. A larger stash can be implemented using data structures such as binary search trees using the on-chip memory in the FPGA. The hardware implementation of multiple tables with a stash is left for future work.
D. SYNTHESIS AND IMPLEMENTATION RESULTS
The hardware is synthesized and implemented on the Xilinx Virtex-7 FPGA using the Xilinx Vivado tools. The configurable logic blocks (CLBs) on Virtex-7 XC7V690T FPGA are composed of slices, each containing lookup tables (LUTs), registers, multiplexers, and other logic resources. The FPGA also provides on-chip memory (Block RAM) resources that are used to implement the FIFOs and the Bloom filter. The AXI4-Stream data path in the NetFPGA-SUME design is clocked at a frequency of 160 MHz, while the input clock frequencies to the SRAM and DRAM controllers are 200 MHz and 233 MHz, respectively. The designed hardware modules are targeted to their corresponding clock rates and can fulfill all timing requirements. Table 2 shows the post place-and-route logic resources utilization of the FlexSketchMon hardware.
V. EVALUATION
This section describes the evaluation of the FlexSketchMon data plane. Network monitoring applications are also developed and used to demonstrate that the proposed system can provide accurate measurement results.
A. EXPERIMENTAL SETUP
The data plane is evaluated through both simulation (using Xilinx Vivado simulator) and hardware testing on the NetFPGA-SUME card. In the simulation environment, packets from the trace file are converted to AXI4-Stream format and used as the input to the 10 GbE interface module. All packets are set to the minimum Ethernet frame size of 64 bytes and sent to one of the 10 Gb Ethernet ports. In the hardware test, a traffic generator/replay tool is used to send packets to the 10 GbE interface. The experimental testbed for hardware testing is shown in Figure 17 . Throughput in Gigabits per second (Gbps) and latency measured in units of time are used as performance metrics in the evaluation. The throughput is calculated based on the clock rate and the 64-byte minimum Ethernet frame size.
B. DATA PLANE PERFORMANCE EVALUATION
As shown in Figure 8 , the FlexSketchMon data path does not interrupt the original NIC pipeline, which allows the NIC to maintain a 10 Gbps line rate per port with a total of 40 Gbps for all four Ethernet ports. The NIC can accept a 64-byte minimum Ethernet frame in two clock cycles at 160 MHz clock frequency, resulting in a throughput of 53.76 Gbps.
The performance of the flow key table is evaluated to observe its latency for writing a flow entry to the memory. The flow key processing data path in FlexSketchMon comprises: 1) parsing the 5-tuple, hash computations, querying/updating the Bloom filter, writing/reading FIFO, combining data for writing to the memory, and asynchronous FIFO writing; followed by, 2) asynchronous FIFO reading and memory writing. It is observed during the simulation that for the second part of the data path, reading the 512 bits data from the asynchronous FIFO to the DDR3 controller takes two clock cycles while one writing to memory takes 19 clock cycles. The total latency is therefore 21 clock cycles. As three flow IDs and 5-tuples are written to the memory in one write operation, the average latency of writing a flow ID and a 5-tuple data is seven clock cycles (≈ 30.1 ns), which corresponds to a 22.3 Gbps throughput for updating the flow key table in a single memory write operation. Updating the flow key table is not the critical path in this design because the update process does not occur for each flow key; instead, the flow key table is updated only if the flow key is not already present in the flow key table.
Updating the counters in the flow counter table is the most important process in the data plane because it is executed for each incoming packet. The critical path lies in the hash computation and memory access used to update the counters. Therefore, the flow counter table throughput is determined mainly by the hash computation and the SRAM access latency. The latency of updating one flow entry is one clock cycle (6.25 ns) for the tabulation 5-universal hash function at 160 MHz clock frequency and four clock cycles (8 ns) for the read-write process at 500 MHz SRAM clock frequency. This corresponds to a 47.15 Gbps throughput for updating the flow counter table. It is important to note that this throughput is calculated only for the first packet that enters the hash table's pipeline. If more packets are used to fill the pipeline, each packet can be processed in around 7 ns, which is equivalent to a 96 Gbps throughput for the 64-byte minimum Ethernet frame size. The throughput of the flow counter table can easily exceed 100 Gbps if a newer FPGA, such as the Xilinx Virtex UltraScale+ with on-chip UltraRAM or QDR-IV SRAM is used.
As mentioned in Section III, the number of entries stored in the flow key table is slightly less than the number of entries in the flow counter table as a result of false positives of the Bloom filter. Based on the assumption that 512K flow IDs and 5-tuples entries (which is equal to the number of addresses in the flow counter table) are stored in the flow key table per observation interval, the total data size is 8.5 MB. In the short observation interval used in our experiments (≤ 15 s), the amount of data in the flow key table is typically less than 8.5 MB and the user-space application is able to read the table straightforwardly using register read access via the PCIe interface. The volume of data transferred is very small relative to the theoretical PCIe Gen3 x8 bandwidth (≈ 8 GB/s). Assuming further that the flow counter table is fully occupied during an observation interval, the size of the data that will be transferred via the PCIe bus is 18 MB, a volume that, once again, is very small relative to the PCIe bus bandwidth. Furthermore, as the observation interval used in the experiments is short, the case in which the flow counter table is fully occupied is very rare, and the data in the table can be retrieved by the user-space applications in less than 10 ms. Because the system does not use a sampling scheme, it is limited by the available SRAM resources when longer observation intervals (> 60 s) are used. Based on the number of flows in the trace used in our experiments, the available SRAM space would be more than sufficient to store all flows within an observation interval of 60 s. In addition, to utilize sketch-based monitoring over longer observation intervals, data from sketches for shorter intervals can be combined because of the linearity property of the sketch data structure.
C. MONITORING APPLICATIONS EVALUATION
Monitoring applications that utilize sketch data structures can exploit the flow-level data collected in the flow counter and flow key tables. They can flexibly select the flow key to be processed based on their measurement task. In this section, entropy estimation, superspreader detection, and heavy hitter detection applications are evaluated. In all experiments, a 1-minute CAIDA trace is used and the observation interval is set to 15 s. The results obtained for these monitoring applications are not empirically compared to results of those related works listed in Table 1 because they employ sampling, 
1) ENTROPY ESTIMATION
Entropy measures the amount of randomness of content in the distribution of random variables (e.g., source IP address, destination IP address, etc.). It has been used in network traffic monitoring to detect DoS attacks, port scanning, and anomalous traffic [13] , [45] . The entropy is defined as H = In the following implementation, the entropy is estimated using the contents provided by the flow counter and flow key tables. The entropy of 5-tuple can be estimated in a straightforward manner because the flows are collected at 5-tuple granularity: the application can read the 5-tuple and packet counter data from the tables, and then calculate the entropy. To obtain the entropy of source IP address, additional processing in which the application processes the source IP address data from the flow key table, updates a hash table using the value from the flow counter table, and then calculates the entropy must be performed, as described in Algorithm 3. This is also required when computing the entropy of destination IP address, source port, or destination port.
The results of the estimation are compared to the exact entropy values, with the accuracy is measured in terms of relative error (RE),
, where H norm andĤ norm are the exact and estimated values, respectively. Figure 18 shows Read the corresponding packet count from the flow counter table CT 6: Hash s to HT 7: if collision then 8: Resolve collision using chaining 9: Update counter of s using the packet count 10: else 11: Update counter of s using the packet count 12: end if 13: end for the experimental results of entropy estimation for the 5-tuple, source IP address, and destination IP address using various Bloom filter sizes. As shown, the application can estimate entropy with high accuracy and can therefore be used to accurately detect anomalous network traffic. The estimation accuracy increases with the size of the Bloom filter, with filter sizes larger than 256 KB yielding estimation errors of less than 1%.
2) SUPERSPREADER DETECTION
The goal of superspreader detection is to find a list of source IP addresses that are in contact with a large number of destination IPs. The application can use source-destination IP pair data from the flow key table to perform the superspreader detection. In the trace used in our experiments, the number of distinct source IP addresses during each 15 s observation interval is around 192k to 206k. Figure 19 shows the number of sources at each fan-out value (number of outgoing connections) in each interval. The number of sources decreases as the fan-out value increases. At a fan-out value of around 200, there is only one or a small number of sources for each fan-out value. In these experiments, superspreaders are defined as sources whose fan-out exceeded 200. The false negative ratio (FNR) and false positive ratio (FPR) are used to measure the accuracy of superspreader detection, while the fan-out estimation accuracy is evaluated using the average relative error (ARE). Let n s denotes the exact fan-out of a superspreader s andn s is its estimated value. The relative error (RE, in percentage) is given by
Suppose the number of detected spreaders is N , then the average relative error (ARE) is computed as
The FNR is defined as the number of superspreaders that are not identified divided by the number of actual superspreaders, while the FPR is obtained as the number of non-superspreaders that are incorrectly identified as superspreaders divided by the number of actual superspreaders.
To identify superspreaders and count their fan-outs, the application maintains two hash tables: one to remove duplicate source-destination IP pairs, and another to count the fan-out as presented in Algorithm 4. Chaining is used in both tables to resolve hash collisions. This method is conventionally used for exact fan-out counting and is very accurate. Figure 20 shows the fan-out estimation accuracy, in which the proximity of a point to the diagonal line represents greater estimation accuracy. The ARE in the fan-out estimation of all detected superspreaders is 0.11%.
3) HEAVY HITTER DETECTION
Heavy hitters are flows that use more than a given fraction, φ, of the total traffic in an observation interval. At the end of each observation interval, the heavy hitter detection algorithm reports the IDs and sizes of all flow whose sizes exceeded a predefined threshold. A heavy hitter can be mathematically defined as follows [46] : Given an input packet stream σ =
where N is the total number of packets in σ , k i is key used to identify the i th input packet, and v i is the value (packet count or size) of the i th input packet; then, for a given threshold φ, 0 < φ ≤ 1, a heavy hitter is a flow identified by key K whose total value
Such a flow is known as a φ-heavy hitter [20] . The monitoring application can use the source IP addresses, packet counters, and byte counters that are provided in the flow counter and flow key tables to detect heavy hitters. We have developed an application for heavy hitter detection that utilizes the Count-Min (CM) sketch algorithm [20] , which uses a two dimensional array of counters, C, with w columns and d rows. Each row is associated with a hash function. At the beginning of an observation interval, all counters are set to zero. The CM sketch update and query operations are respectively defined as update:
, where k and v are the key and updated value, respectively.
In our implementation, the source IP address and packet size are chosen as the key and updated value, respectively. During the update process, the source IP address of each packet is hashed to d counters, which are updated using the packet size. Each counter in the sketch is 32 bits wide. At the end of the observation interval, the sketch is queried using a set of source IP addresses that are stored in another hash table. The estimated value of the total bytes of a source IP address is derived as the minimum value of all of its corresponding counters. Finally, all source IP addresses whose estimated values of traffic volume (total bytes) are greater than or equal to φ N i=1 v i are identified as heavy hitters. The process is shown in Figure 21 .
In a 15 s observation interval, the number of distinct source IP addresses in the trace ranges from approximately 192k to 206k. For each interval, the threshold is set such that the heavy hitters are all source IP addresses whose volumes are greater than or equal to 0.5% (φ = 0.005) of the total bytes in the interval. Table 3 lists the threshold (φ × total bytes) and the actual number of heavy hitters in each observation interval. In the experiments, the sketch depth was fixed to d = 3 and the sketch width w was varied from 1,024 to 4,096. False negatives (i.e., unidentified heavy hitters), false positives (i.e., non-heavy hitters that are incorrectly identified as heavy hitters), and the ARE are used as metrics to evaluate the accuracy of heavy hitter detection. The relative error (RE, in percentage) for a heavy hitter h is given by
where v h denotes the exact traffic volume (total bytes) of a heavy hitter h andv h is its estimated value that is obtained from the sketch. Suppose the number of identified heavy hitters is N , then the ARE is computed as
It is important to note that the data (packet or byte counts) accumulated by the sketch during an observation interval are random and independent of those obtained during other observation intervals. The total number of packets or bytes in an observation interval solely depends on the number of packets or the length of each packet during that interval. Therefore, the number of heavy hitters and the estimated traffic volume of each heavy hitter in an observation interval are independent of the numbers and volumes obtained during other observation intervals. The primary factor affecting the detection accuracy and error in the estimated traffic volume of a heavy hitter is the sketch memory size; for this reason, different sketch memory sizes are used in the experiments. The percentage of false positives in each observation interval at different sketch memory sizes is presented in Figure 22 to show the impact of the sketch memory size on the accuracy of heavy hitter detection. Memory sizes of 12 KB and 48 KB correspond to 3×1,024 and 3×4,096 counters in the sketch, respectively. For each observation interval, the number of false positives decreases to nearly zero as the sketch memory size increases, indicating that the detection result becomes more accurate as the application allocates more memory to the sketch data structure. No false negatives occur at any of the tested sketch memory sizes (12 KB to 48 KB). Figure 23 depicts the ARE in heavy hitter estimated traffic volume in each observation interval at different sketch memory sizes. As shown, the ARE decreases as the application allocates more memory to the sketch data structure. As more counters are allocated for the sketch, the accumulated byte counts of all flows are more distributed; therefore, the accuracy of estimation increases.
D. COMPARISON TO OTHER APPROACHES
We mentioned previously that comparisons to the related approaches listed in Table 1 are not really fair because of differences in data plane design as described in Section II C. Again, we stress that FlexSketchMon does not use sampling or sketching in the data plane; instead, it stores the exact packet and byte counts of all flows in the flow counter table during an observation interval. Either sketch-based monitoring applications or other general monitoring applications can use the flow-level data collected in FlexSketchMon's data plane. Nevertheless, we provide here examples of measurement results obtained using UnivMon for entropy estimation and heavy hitter detection and compare them to the results obtained using FlexSketchMon. We utilize the C++ implementation of UnivMon and evaluate it using the same CAIDA trace file that is used in our experiments. Figure 24 shows the results of entropy estimation using the 5-tuple as the flow key. As shown, FlexSketchMon outperforms UnivMon because it uses exact packet counts in computing the entropy. Errors occurring in the FlexSketchMon estimation are caused by the failure to store some flow keys in the flow key table as a result of false positives of the Bloom filter, which resulted in the exclusion some counters in the flow counter table when the entropy is calculated.
As the implementation of UnivMon for heavy hitter detection uses the fraction of total packet counts instead of the fraction of total bytes as the threshold, we modify our heavy hitter detector to use the fraction of total packet counts as well. The same settings (φ = 0.005, d = 3, w = 1, 024 to 4, 096) were used to evaluate the implementations by each method. Table 4 lists the thresholds (φ × total packet counts) and the actual number of heavy hitters in each observation interval obtained using the fraction of total packet counts as the threshold for heavy hitter detection.
The experimental results show that when using the fraction of total packet counts as the threshold, the FlexSketchMon heavy hitter detector generates very accurate results without false positives or negatives. Furthermore, the ARE on the estimated packet counts of the identified heavy hitters is less than 1% in all observation intervals. The heavy hitter detection results obtained from UnivMon for the heavy hitter detection are similar to those obtained from FlexSketchMon, with no false positives or negatives. These results are expected as the data structure (Count sketch) used by UnivMon is originally tailored to heavy hitter detection. Nevertheless, FlexSketchMon is better (i.e., has smaller ARE) at estimating the packet counts of the identified heavy hitters. Superspreader detection presents a special case as it only requires the source IP address and source-destination IP address pair data; the current implementation of UnivMon cannot produce accurate measurement results for superspreader detection because it only collects a limited number of flow keys.
We then compare the heavy hitter detection performance of FlexSketchMon to that of OpenSketch using the available OpenSketch C++ implementation [47] . The same parameters settings (d, w, threshold) were set as above, with the threshold chosen based on the packet count. The ARE in the estimation of packet counts in each observation interval is shown in Figure 25 . Although both FlexSketchMon and OpenSketch can correctly identify all heavy hitters in each interval, FlexSketchMon outperforms OpenSketch in estimating the packet counts of the identified heavy hitters. The superspreader detection in OpenSketch is based on the algorithm in [48] that uses sampling and is not quite accurate, resulting in many false positives and false negatives in our experimental results. OpenSketch was easily outperformed by FlexSketchMon, which uses exact data for counting the fan-out of the superspreaders. Based on the above experimental results, FlexSketchMon achieves more accurate measurement results than those of UnivMon in entropy estimation and heavy hitter detection. It also has better measurement results than those of OpenSketch in superspreader detection and heavy hitter detection. In summary, FlexSketchMon offers comparable and better measurement accuracy than the previous approaches.
VI. CONCLUSION AND FUTURE WORK
This paper presents FlexSketchMon, a system that can support a diversity of sketch-based network traffic measurement tasks with great flexibility. FlexSketchMon also accelerates the sketch-based network measurement process by aggregating flows at 5-tuple granularity before the flow statistics are used by the monitoring applications. The implementations of the flow key table in DDR3 DRAM and the flow counter table in QDRII+ SRAM are presented. By using a 5-universal hash function, the flow counter table can be updated in deterministic time. The system was evaluated in software and hardware settings with a real traffic trace for entropy estimation, superspreader detection, and heavy hitter detection. Our experimental results demonstrate that monitoring applications that utilize sketch-based algorithms can flexibly select required flow keys and update their sketch data structures, without changes (reprogramming or redesigning) to the hardware data plane. The monitoring applications can also execute their tasks and obtain accurate measurements. The entropy estimator application obtains less than 1% estimation error in computing the entropy of a flow key. In carrying out superspreader detection, the application can identify the spreaders at very low false positive and negative rates. In performing heavy hitter detection, the application can detect heavy hitters accurately and estimate their traffic volumes with an estimation error of less than 1%. The system is implemented on the NetFPGA-SUME platform and can process network traffic at a maximum throughput of 96 Gbps in the worst-case scenario of a 64-byte minimum size Ethernet frame.
In future work, we plan to extend the data plane of FlexSketchMon by implementing multiple hash tables with a stash for the flow counter table and incorporating additional DDR3 DRAM for the flow key table. We will also further optimize the hardware design to improve the system's performance. As a contribution to the community, our IP cores will be released to GitHub [49] in the near future.
