With the slowdown of Moore's law, CPU-oriented packet processing in software
will be significantly outpaced by emerging line speeds of network interface
cards (NICs). Single-core packet-processing throughput has saturated.
We consider the problem of high-speed packet processing with multiple CPU
cores. The key challenge is state--memory that multiple packets must read and
update. The prevailing method to scale throughput with multiple cores involves
state sharding, processing all packets that update the same state, i.e., flow,
at the same core. However, given the heavy-tailed nature of realistic flow size
distributions, this method will be untenable in the near future, since total
throughput is severely limited by single core performance.
This paper introduces state-compute replication, a principle to scale the
throughput of a single stateful flow across multiple cores using replication.
Our design leverages a packet history sequencer running on a NIC or
top-of-the-rack switch to enable multiple cores to update state without
explicit synchronization. Our experiments with realistic data center and
wide-area Internet traces shows that state-compute replication can scale total
packet-processing throughput linearly with cores, deterministically and
independent of flow size distributions, across a range of realistic
packet-processing programs