Abstract-Commodity-hardware technology has advanced in great leaps in terms of CPU, memory, and 110 bus speeds.
I. INTRODUCTION
Software routers are attractive platforms for flexible packet processing. While the early routers were built on general purpose computers, they could not compete with carrier grade routers with tens of Gbps or higher speed and gave way to specialized hardware in late 90's. With the recent advancements in PC hardware, such as multi-core CPUs, high bandwidth network cards, fast CPU-to-memory interconnects and system buses, software routers are coming back with competitive cost performance ratio. For example, RouteBricks, a experimental software router platform, reports 8.33 Gbps, or 6 .35 Gbps excluding Ethernet overheads, for IPv4 routing of 64B packets on a single PC [3] . In this paper we raise the following question: How far can we push the performance of a single-box software router with technologies from today and in the predictable future? We map out expected hurdles and project speed-ups to reach 100 Gbps on a single x86 machine. Can we improve per-packet processing overhead? We find the solutions in packet processing software optimizations.
II. OPPORTUNITIES AND CHALLENGES

Recent architectural improvements from
RouteBricks uses NIC-driven batching for performance im provement. We propose the following for further improve ments: (i) remove dynamic per-packet buffer allocation and use static buffers instead; (ii) perform prefetch over descriptors and packet data to mitigate compulsory cache misses; (iii) minimize cache bouncing and eliminate false sharing [6] between CPU cores. By incorporating these improvements, we achieve about a factor of six reduction in per-packet processing overhead and reduce the required number of CPU cycles to under 200 CPU cycles per packet [4] . Then the total number of CPU cycles required for the 100 Gbps forwarding speed comes down to 30 GHz, which is achievable with today's CPUs.
While packet forwarding is the core functionality of routers, it is one of many tasks that a typical router performs. 
Configuration (ii)
Node 0 110 Capacity Measurement: We measure 110 capacity to see whether the system achieves the theoretical limits. At the time of this experiment we have access to only eight NICs, half of which are used as packet generators. We limit our experiment to four dual-port NICs. We use two systems for evaluation.
One is a server machine with dual CPU sockets and dual 10Hs, and the other is a desktop machine with one CPU socket and a single 10H. The desktop machine has three PCle slots: two are occupied by NICs and one by graphics card.
To gauge 110 capacity and identify its bottleneck, we consider three configurations in Figure 2 : (i) three NICs are connected to one 10H in the server system and one NUMA node is used for packet processing, (ii) two NICs are connected to each 10H in the server system (four NICs in total), and Packet size (bytes) In Figure 4 , we also show the results when packets cross the NUMA nodes, but we see little performance degradation compared to non-crossing cases. It implies that the QPI link is not the limiting factor at 40 Gbps but the question remains as to where the bottleneck is.
To find the cause of throughput difference between RX and T X side, we conduct the same experiment with the desktop system. Figure 5 shows the results. Interestingly RX and T X reach the full throughput of 40 Gbps with two NICs. This leads us to the conclusion that the RX performance degradation in the server system is due to dual IOHs rather than dual CPU sockets. By Googling, we find that the receive 110 throughput degradation with dual IOHs is also known to the GPGPU programming community and that single IOH with dual sockets did not have the problem [2] . Forwarding performance is around 30 Gbps, and is lower than RX and T X throughput. Since QPI and PCIe bus are full-duplex links, 110 should not be the problem. We find that the forwarding performance in the desktop scenario is limited by the memory bottleneck. We explain further details in the following section.
C. Memory Bandwidth
Forwarding a packet involves several memory access. To forward 100 Gbps traffic, the minimum memory bandwidth for packet data is 400 Gbps (100 Gbps for transfer between NICs and memory, another 100 Gbps for transfer between memory and CPUs, and doubled for each direction of RX and TX). Bookkeeping operations with packet descriptors add 16
bytes memory read/write access for each packet, giving more pressure on memory buses depending on packet sizes.
In Figure 5 , we see that the forwarding throughput is lower than that of RX and TX due to insufficient memory bandwidth.
We find that (i) CPU usage for forwarding is 100% regardless of packet sizes and load/store memory stall wastes most CPU cycles and (ii) with memory overclocking to have more memory bandwidth, we improve the forwarding performance close to 40 Gbps.
For both our server and desktop configurations, we use triple-channeled DDR3 1, 333MHz, giving theoretical peak bandwidth of 32.0 GB/s for each CPU and 17.9 GB/s empir ical bandwidth according to our experiments. Unfortunately, assuming two nodes in the system, we need effective memory bandwidth of 25 GB/s for each node to forward 100 Gbps traffic.
One way to boost the memory bandwidth in NUMA systems is to have more nodes and to distribute the load to multiple physical memory in different nodes. In this case, NUMA aware data placement becomes particularly important. This is because remote memory access is expensive in NUMA systems in terms of latency and may overload interconnects between nodes. High-performance software routers on NUMA systems should consider careful node partitioning so that communication between node be minimized.
III. DISCUSSION AND FUTURE WORK
In this paper we have reviewed the feasibility of a 100 Gbps router with today's technology. We find two major bottlenecks in the current PC architecture: CPU cycles and YO bandwidth.
For the former, we propose reducing per-packet processing overhead with optimization techniques and amplifying the computing cycles with FPGAs or GPUs. For the latter, we believe the improvement in IOH chipsets and multi-IOH configuration, and more memory bandwidth with four or more CPU sockets could alleviate the bottleneck. A 100 Gbps software router will open up great opportunities for researchers to experiment with new ideas and we believe it will be a reality in the near future.
