System analysis
To date, efforts on exploiting parallelism for network monitoring have focused heavily on multi-core parallelizing analysis. Vern Paxson et al designed the architecture for exploiting multi-core processors to parallelize network intrusion prevention [11] , but test result was not given. In [11] , a custom device based on FPGA platform serves as the frond-end for dispatching copies of the packets to a set of analysis threads, which are structured as an event-based system. Associating the events with the packets, the system knows how long analyzing for a given packet has completed. But the event mechanism is based on CPU's interrupt, so rapid packet rate will cause a large mount of interrupts, leading to excessive consumption of system resources.
The improvements of multi-core platform would significantly reduce the cost of packet processing, which makes it possible for real-time traffic monitoring on 10Gigabit network. Here we estimate the packet processing capacity of the multi-core measurement platform which using Intel Xeon 4-cores processor, 4GB DDR3-SDRAM, Linux-2.6.20 64bit Server hardware.
Lets: 1. IC: denotes the CPU clock cycles needed to process a packet; 2. CT: the clock cycles of CPU's each individual core; 3. N: the number of CPU's cores; 4. M: the packet processing capacity of the measurement platform; According to Amdahl's law: 1
Where, F is the proportion of a program that can not be made parallelized. So, the value M can be expressed according to the following Equation:
Supposing, N = 8, CT = 3.33GHz, IC = 1000, and F < 20%, from Equation (2), M ≥ 11.1Mpps; if F decreases to 10%, M ≥ 14.8Mpps, which closes to the maximum packet forwarding rate of OC-192 (14.88Mpps). It can be concluded from above theoretical analysis that using commodity multi-core processor to achieve 10Gbps network monitoring is feasible.
According Equation (2) , to process packet more effectively, the following ways can be used: 1) optimizing the parallel architecture for decreasing the value F, 2) faster algorithms for processing each packet to decrease the value IC, and 3) more cores for increasing the value N. In this paper, we focus on how to decrease the value F, the outline of our approach as follows:
a. Dispatching packets to each core equally for improving the load balance between them. b. Threads that sharing common data need to be run on the same CPU core, thus reducing communication load between them.
c. Improving the cache data accessibility. Figure. 1 illustrates our architecture. At the bottom of the diagram is the 10-Gigabit server adapter, which provides the interface to the network. The adapter uses the Direct Access Memory (DMA) mechanism to transfer data without subjecting the CPU to a heavy workload. When an upstream data transfer is completed through DMA, the adapter signals it to CPU with an IRQ, and then the Operation System (OS) starts fulfilling standard NAPI procedures.
Overview of the architecture
It is critical to make wise use of such multi-core processors platform, and programs must be specifically designed to have a parallelizable structure. However, not only is it crucial to parallelize the program's execution structure, but also its memory access patterns. 
Network Card

Task paralleling
The task paralleling needs to be considered as follows. Firstly, the load balances among multi-cores. Secondly, each task's executing threads should keep synchronous. Thirdly, the communication load between threads on different processor cores needs to be reduced.
To improve the load balances among multi-cores, we propose a packet selection algorithm, which dispatches the packets with similar properties to the same processor core. The details can be found in Figure 2 . 
Data partition
For reducing communication load and data sharing between threads, a data partition method is proposed to separate the packet data into several partitions equally, which are suitable for high-speed 1. Extracting the five-tuple 'flow' features from packet; 2. Calculate hash key on the five-tuple, which includes source IP, port, destination IP and port; 3. Generate a variable C using hash key according to equation: C = key MOD N 4. Select one of processor core using C. 5. Dispatch the packet to the C-core, and storage the packet data and hash key in its L2-cache queue.
cache accessing of each thread. A mutex variable will increase 1 after each thread finished its data partition processing, and it can be used to determine whether the packet processing has been accomplished. Each partition needs 20 bytes (8 bytes pointer links to record, 8 bytes pointer links to function (x), 8 bytes type) head to storage the information which used to link a processing thread. More partitions require more extra storage space.
Cache
The multi-core processors platform provides L1 cache, L2 cache that shared between cores and mass SDRAM for designers. Taking advantage of L2 shared cache, the system storages the partitioned data in the multi-level cache, the detail is presented in Fig.1 .
When the OS starts processing packet data, it firstly transports packet to the corresponding queue in L2 cache through the packet selection algorithm. The queue structured by double linked list, each CPU core has its own queue. The node that contains packet data is inserted into the tail of the queue after the processing through data partitioning algorithm and removed from the head of the queue. If the queue is empty, tasks will be blocked; and if the queue if full, new packet data will be dropped (this only happened if packet rate beyond the capability of hardware). And then, the node data is loaded into L1 cache for processing. The mutex variable that shared between multiple parallel threads is used to decide when the processing will be finished. If it is greater than the number of threads, the processing will be finished and the node data will be deleted. At last, processing results will be written into the record queue which is also structured through double linked list.
A hash buffer is proposed for fast locating records in queue, which occupies a large mount of physic memory (by function-alloc_bootmem_low_pages ()). The pointer that stored in the hash buffer is used to point to the record, which can be found through the hash key.
Experiment results
The network monitoring system based on parallel architecture has been implemented and applied, which provides DPI packet identification and abnormal detection. The system have been performed on a server with two Intel Xeon 5504 2.0GHz CPU (8 core), 4GB PC-1333 RAM, Linux 2.6.27-64bit and Intel-EXPX9501-10G adaptor; the code was complied with GCC v4.1.2 with -O3 optimization level. In Figure. 3, Huawei-S9312 provides 72 1Gigabit ports and 4 10Gigabit ports; each 1Gigabit port connects a number of PC hosts; a 10Gigabit port uplink the router to ISP aggregation switcher, whose traffic in both directions is mirrored to another 10Gigabit port where our system collects data from.
Influence of threads per core
In this section we analyze the influence of number of threads per core on packet processing cost and memory cost in L2-cache. Since the execution time of those processing is rather small, we counted clock ticks through the RDTSC assembly instruction available on Intel processors. As shown in Figure.4 (I) , the number of threads per core has great impact on packet processing cost, but the number of CPU has little effect on it. When the number of threads increases, the packet processing cost decreased. From 1 to 2, the decrease is 20%. From 2 to 4, the decrease reaches 43%. But from 4 to 8, the decrease is only 16%. However, increasing the number of threads per core will consume more extra memory in L2-cache. In Figure. 4 (II), when the number of threads reaches 4, it needs only 18% extra memory. But when it increases to 8, the extra memory cost reaches 31%. It can be concluded that four threads per core is more appropriate in this scenario.
Load balance between multi-cores
The load balancing among multi-cores (memory usage in L2-cache and CPU usage) are measured in this section, which are affected by packet selection algorithm mentioned in section 3.1. 
Performance evaluation
This section reports the experimental analysis of performance, which mainly contains the CPU and memory usages. In order to evaluate the effectiveness of our parallel architecture, measurements were executed with two sets of real traces which are given in Table I . In Set2, the system is fully applied with our parallel architecture; while in Set1, the system is not applied the task paralleling. The distributions of CPU usage and packet rate in the two groups of real traces are shown in Figure. 6 (I-a) and (II-a), the CPU usage reduced greatly when parallel architecture applied, the CPU usage of Set1 is more than twice that of Set2. Figure.6 (I-b) shows that, without applying parallel architecture, the CPU usage is fluctuating and unstable. However, as shown in Figure. 6 (I-c) and (II-c), the memory usage of set2 increases 3% than that of set1, but it keeps stable.
As shown in Figure. 6 (II-b), with our parallel architecture applied, the curve of packet processing rate is completely covered by the curve of CPU usage. Surprisingly, we suppose that the CPU usage will reaches 100% when the packet rate achieves 14.4Mpps -nearly the maximum packet forwarding rate of OC-192 (14.88Mpps). Now, the packet length on Internet is about 450 bytes. When processing rate of our system reaches its maximum (2.8Mpps), corresponding link bandwidth can be calculated as 2.8Mpps*450bytes = 10.8Gbps, and the CPU usage is only 25% then. Although the performance of the CPU usage can not be fully tested under current conditions, but it can be inferred that the proposed parallel architecture can theoretically support real-time 10Gbps network monitoring.
Conclusion
In this paper, we give an architecture for traffic monitoring system, by fully exploiting the parallel power of general-purpose multi-core processor. The contributions of our work are as follows: 1) an architecture using commodity multi-core processor is proposed for real-time monitoring on 10Gigabit network. 2) a packet selection algorithm, dispatching the packets with similar properties to the same processor core for improving load balance between multi-cores, is proposed. 3) a data partition method, separating the packet equally into several partitions for reducing overload on communication and data sharing between threads, is presented.
We implemented a prototypical traffic monitoring system based on our architecture using a standard server PC, and evaluated the system performance. The system is tested on campus network, and the results show that the system can meet the 10Gbps network's monitoring need.
