In this paper we present an FPGA-based architecture to export flows in 10 Gbps networks, implemented on the NetFPGA-10G platform. 
INTRODUCTION
Network operators routinely use flow-based tools in order to track down bandwidth utilization as well as network dysfunctionalities and attacks. The infrastructure mainly consists of two elements: The flow exporter, which analyzes packets in the network and creates the flows, and the flow collector, which receives flows from exporters and stores them for future processing. Flow exporters are typically implemented inside routers and switches, taking advantage of their resources to analyze packets. Periodically, exporters send finished flows to the collector using some specific protocol such as Cisco's NetFlow, Juniper's Jflow or the IETF standard IPFIX. The main advantage of monitoring a network using routers and switches as flow exporters is that is not necessary to add any extra component. Unfortunately, in the context of high-speed networks, this approach exhibits the following drawbacks:
• In highly-loaded networks, routers and switches can be burdened by too much traffic. When this occurs, flow monitoring is skipped in order to dedicate all computing resources to route packets, thus causing all flow information to be lost [1] .
• In high-speed networks, flow-based monitoring is accomplished by routers and switches on the basis of packet sampling, i.e. not all the packets on the network are analyzed. This could result in a poor accuracy of the data delivered to the collector.
• Routers and switches have limited resources, so they cannot scale to higher link rates or larger memories to store more active flows. Moreover, these network devices are closed platforms, so even if they perform flow-based monitoring, network engineers are not free to modify how flows are defined or what type of information is collected. A dedicated flow exporter may overcome the limitations listed above. In this paper we propose a flexible, opensource platform to generate flow information, which is suitable for 10 Gbps high-speed networks. The advantage of this approach is that it guarantees an accurate result, not based on sampling, regardless of routers and switches load. The extra cost of adding a dedicated flow exporter is mitigated by providing an open-source solution, which also has the additional benefit of being customizable to meet any particular network monitoring need. Here, we show how FPGA technology is a perfect match for achieving this goal.
RELATED WORK
There are many network probes that generate network flows at 10 Gbps. Some are made based on software, using commodity servers, while others are based on FPGA.
Software based probes: High-speed software implementations require multicore architectures and a careful balance between cores. nProbe [2] and softflowd [3] are two popular open-source approaches. Danelutto et al. [4] reports the use of several advanced parallel programming techniques in the implementation of ffProbe and compares against nprobe and softflowd. They achieve up to near 10 Mpps for ffProbe and poorer results for the others.
FPGA Probes:Žádník [5] makes a first implementation on NetFPGA-1G (a Virtex-2 platform). An advanced 10 Gbps, 256,000 concurrent active flow implementation is Table. presented in FlowMon [6] using Virtex-5 in the context of the Liberouter project . Yusuf et al. [7] proposed an architecture for network flow analysis using a Virtex-2 device, which is able to store up to 65,536 concurrent flows at a maximum rate below 3 Mpps. Rajeswari et al. [8] gives some results for Virtex-5, claiming a superior speed but only for 500 concurrent flows.
FLOW-BASED MONITORING TECHNIQUE AND DEVELOPMENT PLATFORM
In this section we review the key concepts of flow-based analysis, and we briefly present NetFPGA-10G, the platform where designs have been implemented.
Flow Cache and Network Flow as the Traffic Analysis Unit
A flow, according to Cisco's definition, is a unidirectional stream of packets between a given source and destination [9] . Each flow is identified by five key fields (5-tuple henceforth): Source IP address, Destination IP address, Source port number, Destination port number, and Layer 3 protocol type. Packets with the same 5-tuple belong to the same flow. A fast local memory inside the exporter, known as flow cache, is used to store the active flows of the link that is being monitored. The data structure on the flow cache is called flow table and it consists of a list of flow records, one for each active flow. Apart from the 5-tuple, each flow record contains the number of packets, the total number of transmitted bytes, the timestamp of the flow creation/expiration and the TCP flags. This information is used for a later traffic analysis, once the flow is purged from the flow cache. Every time a packet is received, the memory is polled to determine if the extracted 5-tuple matches an active flow, if not, a new flow entry is created. Otherwise the active flow in the flow table is updated. Parallel to the flow creation and updates, there is a mechanism that is in charge of removing the flow records from the flow table once they are no longer on the link. Flows expire either by timeout or when the end of a TCP transmission is detected (signalled by the FIN or RST flags). As it can be inferred, two concurrent processes access the flow table (the memory) as depicted in Fig. 1 : Flow creation and update, and flow expiration. 
Development Platform
The design has been implemented and tested on NetFPGA-10G, which is the second release of the NetFPGA project, an effort to develop an open-source hardware and software platform for enabling rapid prototyping of networking devices. It was developed by the Standford University together with the Xilinx Research Labs to help researchers quickly build fast and complex designs, mainly in HW. The platform is intended to provide everything necessary to get end users off the ground faster, while the open-source community allows researchers to leverage each other's work. The platform has a Virtex-5 TX240T FPGA with four SFP+ cages connected through AEL2005 PHY chips, which provide four independent 10 Gbps Ethernet ports. The board is also populated with three QDR-II and four RLDRAM memory devices that respectively provide 27MB and 288MB of external storage.
PROPOSED ARCHITECTURES
In this section we examine the two implementations that were developed: NF BRAM and NF QDR. The former uses internal BlockRAMs to store the active flows, while the latter uses external QDR-II memory instead. The NF BRAM implementation supports up to 16,384 concurrent flows and the NF QDR supports up to 786,432. Both of them were designed to work without packet sampling in order to enable a precise flow analysis. The architectures are available throughout a public repository [10] .
Architecture of NF BRAM
This implementation of the flow cache uses the BlockRAMs available on the FPGA to store the flow table . Since BlockRAMs are true dual port, the connection of the two concurrent processes depicted in Fig. 1 , is as simple as connecting the logic of Process A to one port of BlockRAMs and Process B to the other. The building blocks of the NF BRAM architecture shown in Fig. 2 are:
Packet Parser. This module extracts the 5-tuple from the Ethernet frames, plus the information needed to create a new flow or update an existing one: Timestamp, TCP flags (if this is the case) and the number of bytes within the frame.
Hashing Module. When a 5-tuple is received from the packet parser, this unit calculates a hash code to obtain an address where the flow record will be stored. This module is intended to be modified during an implementation optimization, since the probability of collision depends on the input 5-tuples that follow a non-uniform distribution.
Create/Update Flows. Is the name given to Process A in Fig. 1 . With the previously calculated hash code, the flow table is addressed and its content analyzed. If the busy flag is set, it means that an active flow is on that location, so the received 5-tuple is compared to the stored one. If they match, the flow is updated with the information of the received packet, if they do not match there is a collision and the received packet is discarded. On the other hand, if the busy flag is clear, then a new flow record is created in that position of the flow table. If a TCP packet with asserted RST or FIN flags is received, the memory is polled to check if there is an active flow to which the packet belongs to. In that case the flow is updated and exported immediately.
Flow Table. As stated earlier, this module is implemented with the BlockRAMs of the FPGA. It is organized to allow each flow record to be stored and read back in one memory address. The access of the two processes to the memory is completely in parallel and independent.
Timeout Conditions Monitor. Is the name given to Process B in Fig. 1 and it performs two operations. The first one checks that the time elapsed since the last packet of the flow arrived is less than a predefined inactivity timeout. The second operation consists of checking that the time elapsed since the flow record was created is less than a predefined maximum flow duration. If either of the two conditions are satisfied, then the flow record is exported and removed from the flow table.
Export Module. This module receives the flow records that were purged from the flow table and exports them out of the flow cache core. This block is the one that implements the flow exporting protocol, such as NetFlow v5, v9 or IP-FIX. Flows are exported through one of the 10 Gbps Ethernet ports available in NetFPGA-10G, but this block could also be connected to the PCIe DMA engine in order to send flow records directly to the host computer.
Architecture of NF QDR
This second architecture boosts flow monitoring in three manners. First, it makes use of the external QDR-II memory in order to implement a much bigger flow table (786,432 vs. 16,384 for the NF BRAM architecture), and it also consumes much less BlockRAMs. Second, flow records are now 288-bit long, instead of the original 241 bits in the original NF BRAM design, so there are 47 additional bits to store extra information. Third, it reduces flow drops caused by collisions in the hash function. A top level block diagram of the NF QDR implementation is shown in Fig. 3 . The main difference with the NF BRAM architecture is due to the memory access of the two concurrent processes. Since there is only one available port in the QDR-II external memories, we must provide a multiplexing mechanism for both processes to share the communication with the memory. We will just describe the Flow Look-up, Internal Cache and Main Memory Arbiter blocks of NF QDR, since all the other modules have the same functionality as described in 4.1.
Flow Look-up. When a hash code is calculated from an extracted 5-tuple, this module first looks-up if the active flow record is in the internal cache module. If the flow is in cache, it is updated and the update is written back to the external main memory. If the flow is not found in the cache, this module performs a read operation to the main memory. The information about the received packet and its hash code is passed to the create/update flows module and the process of creating a new flow record or updating an existing one is almost the same than in the other architecture.
Internal Cache. In order to reduce the read operations to the main memory, this cache module is used to store the most recently created flows. Burst of packets that belong to the same flow do not poll the memory every time they are received to check if the flow is active. If the flow to be updated upon a packet reception is in the cache, then the external memory is only addressed to write the updated flow back so an exact copy of the information in cache is present in the main memory. This module increases the bandwidth available for Process B that must share the communication to the memory with Process A, as depicted on Fig. 1 .
Main Memory Arbiter. The two concurrent processes must share the communication with the main memory since no true dual ports are available in this case. The reception of a packet is an asynchronous event so a predefined scheduling is not possible. The dispatcher maintains a number of read operations on queue that maximizes the memory throughput, but the Process B's pending tasks are limited to a small number so a good response time for the creation and updates of the flows (Process A's tasks) is achieved.
RESULTS AND VALIDATION
In this section we firstly report the total amount of FPGA resources used by the two proposed architectures. Then we describe the procedures followed to test the performance of the described designs, covering hardware, software and traffic details.
Hardware Resource Utilization
The designs were coded in VHDL and synthesized using Xilinx EDK/XST v13.4. Table 1 shows the main resource utilization of the FPGA for each design. The clock frequency for both architectures is 200 MHz, i.e. the one of the user side of the MAC core. A careful design methodology was taken in order to meet all timing constraints.
Experimental Testbed
The verification and stress test of the designs were carried out in our laboratories with the aid of software tools. Two general-purpose PCs containing 10 Gbps Ethernet interfaces were connected to the NetFPGA-10G platform. The first PC was used as traffic generator, running a high-performance network driver [11] capable of saturating a 10 Gbps link. The second machine captured in a file the output flow records exported by the design under test. The same input traffic was processed offline with a well-known flow capturing software [12] . Both output files were compared to extract the differences. From a detailed analysis of the results, the observed differences were due to the collisions occurred in the hardware implementations. We utilized both synthetic traffic and real traces for the tests. With the synthetic traffic we tested the worst case scenario using a loop generator of minimum size packets with minimum interframe gaps during a 100-second run. With the real traffic captures, we tested flow creation in a real scenario and checked the output against the software tools mentioned above.
CONCLUSIONS
We have developed an accurate and flexible flow classifier for 10 Gbps networks. The proposed design is able to cope with saturated 10 Gbps links even for the highest packet rates (14.88 Mpps for the shortest 64-byte Ethernet frames). Additionally, packet sampling is never used in order to improve accuracy in the traffic analysis process, and the design supports up to 786,432 concurrent active flows. We proposed two architectures, one that uses the internal BlockRAMs of the FPGA (NF BRAM) and a second one that uses the external QDR-II memories (NF QDR). The HDL code for both architectures has been released as public opensource hardware projects. The final goal of the work presented here is to provide a high-speed and low-cost flow export platform for 10 Gbps Ethernet, which could be used by the community as a basis for the development of open flowbased monitoring tools.
