3,090 research outputs found
Why Does Flow Director Cause Packet Reordering?
Intel Ethernet Flow Director is an advanced network interface card (NIC)
technology. It provides the benefits of parallel receive processing in
multiprocessing environments and can automatically steer incoming network data
to the same core on which its application process resides. However, our
analysis and experiments show that Flow Director cannot guarantee in-order
packet delivery in multiprocessing environments. Packet reordering causes
various negative impacts. E.g., TCP performs poorly with severe packet
reordering. In this paper, we use a simplified model to analyze why Flow
Director can cause packet reordering. Our experiments verify our analysis
State-Compute Replication: Parallelizing High-Speed Stateful Packet Processing
With the slowdown of Moore's law, CPU-oriented packet processing in software
will be significantly outpaced by emerging line speeds of network interface
cards (NICs). Single-core packet-processing throughput has saturated.
We consider the problem of high-speed packet processing with multiple CPU
cores. The key challenge is state--memory that multiple packets must read and
update. The prevailing method to scale throughput with multiple cores involves
state sharding, processing all packets that update the same state, i.e., flow,
at the same core. However, given the heavy-tailed nature of realistic flow size
distributions, this method will be untenable in the near future, since total
throughput is severely limited by single core performance.
This paper introduces state-compute replication, a principle to scale the
throughput of a single stateful flow across multiple cores using replication.
Our design leverages a packet history sequencer running on a NIC or
top-of-the-rack switch to enable multiple cores to update state without
explicit synchronization. Our experiments with realistic data center and
wide-area Internet traces shows that state-compute replication can scale total
packet-processing throughput linearly with cores, deterministically and
independent of flow size distributions, across a range of realistic
packet-processing programs
Automatic Parallelization of Software Network Functions
Software network functions (NFs) trade-off flexibility and ease of deployment
for an increased challenge of performance. The traditional way to increase NF
performance is by distributing traffic to multiple CPU cores, but this poses a
significant challenge: how to parallelize an NF without breaking its semantics?
We propose Maestro, a tool that analyzes a sequential implementation of an NF
and automatically generates an enhanced parallel version that carefully
configures the NIC's Receive Side Scaling mechanism to distribute traffic
across cores, while preserving semantics. When possible, Maestro orchestrates a
shared-nothing architecture, with each core operating independently without
shared memory coordination, maximizing performance. Otherwise, Maestro
choreographs a fine-grained read-write locking mechanism that optimizes
operation for typical Internet traffic. We parallelized 8 software NFs and show
that they generally scale-up linearly until bottlenecked by PCIe when using
small packets or by 100Gbps line-rate with typical Internet traffic. Maestro
further outperforms modern hardware-based transactional memory mechanisms, even
for challenging parallel-unfriendly workloads.Comment: 21 pages, 14 figures, to be published in NSDI2
Packet Filtering Module For PFQ Packet Capturing Engine.
The evolution of commodity hardware is pushing parallelism forward as the key factor that can allow software to attain hardware-class performance while still retaining its advantages. On one side, commodity CPUs are providing more and more cores (the next-generation Intel Xeon E 7500 CPUs will soon make 10 cores processors a commodity product), with a complex cache hierarchy which makes aware data placement crucial to good performance. On the other side, server NIC‘s are adapting to these new trends by increasing themselves their level of parallelism. While traditional 1Gbps NICs exchanged data with the CPU through a single ring of shared memory buffers, modern 10Gbps cards support multiple queues: multiple cores can therefore receive and transmit packets in parallel. In particular, incoming packets can be de-multiplexed across CPUs based on a hash function (the so-called RSS technology) or on the MAC address (the VMD-q technology, designed for servers hosting multiple virtual machines). The Linux kernel has recently begun to support these new technologies. Though there is lot of network monitoring software‘s, most of them have not yet been designed with high parallelism in mind. Therefore a novel packet capturing engine, named PFQ was designed, that allows efficient capturing and in-kernel aggregation, as well as connection-aware load balancing. Such an engine is based on a novel lockless queue and allows parallel packet capturing to let the user-space application arbitrarily define its degree of parallelism. Therefore, both legacy applications and natively parallel ones can benefit from such capturing engine. In addition, PFQ outperforms its competitors both in terms of captured packets and CPU consumption. In this thesis, a new packet filtering block is designed implemented and added to the existing PFQ capture engine which helps in dropping out unnecessary packets before they are copied into the kernel space thus improves the overall performance of the engine considerably. Because network monitors often want only a small subset of network traffic, a dramatic performance gain is realized by filtering out unwanted packets in interrupt context
High-performance network traffic processing systems using commodity hardware
The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-36784-7_1The Internet has opened new avenues for information ac-
cessing and sharing in a variety of media formats. Such popularity has
resulted in an increase of the amount of resources consumed in backbone
links, whose capacities have witnessed numerous upgrades to cope with
the ever-increasing demand for bandwidth. Consequently, network tra c
processing at today's data transmission rates is a very demanding task,
which has been traditionally accomplished by means of specialized hard-
ware tailored to speci c tasks. However, such approaches lack either of
exibility or extensibility|or both. As an alternative, the research com-
munity has pointed to the utilization of commodity hardware, which may
provide
exible and extensible cost-aware solutions, ergo entailing large
reductions of the operational and capital expenditure investments. In
this chapter, we provide a survey-like introduction to high-performance
network tra c processing using commodity hardware. We present the
required background to understand the di erent solutions proposed in
the literature to achieve high-speed lossless packet capture, which are
reviewed and compared
Packet Fan-Out Extension for the pcap Library
The large availability of multi-gigabit network cards for commodity PCs requires network applications to potentially cope with high volumes of traffic. However, computation intensive operations may not catch up with high traffic rates and need to be run in parallel over multiple processing cores. As of today, the vast majority of network applications - e.g., monitoring and IDS systems - are still based on the pcap library interface which, unfortunately, does not provide the native multi-core support, even though the current underlying capture technologies do. This paper introduces a novel version of the pcap library for the Linux operating system that enables transparent application level parallelism. The new library supports fan-out operations for both multi-threaded and multi-process applications, by means of extended API as well as by a declarative grammar for configuration files, suitable for legacy applications. In addition, the library can transparently run on top of the standard Linux socket as well as on other accelerated active engines. Performance evaluation has been carried out on a multi-core architecture in pure capture tests and in more realistic use cases involving monitoring applications such as Tstat and Bro, with standard Linux socket as well as PFRING and PFQ accelerated engines
Separation Framework: An Enabler for Cooperative and D2D Communication for Future 5G Networks
Soaring capacity and coverage demands dictate that future cellular networks
need to soon migrate towards ultra-dense networks. However, network
densification comes with a host of challenges that include compromised energy
efficiency, complex interference management, cumbersome mobility management,
burdensome signaling overheads and higher backhaul costs. Interestingly, most
of the problems, that beleaguer network densification, stem from legacy
networks' one common feature i.e., tight coupling between the control and data
planes regardless of their degree of heterogeneity and cell density.
Consequently, in wake of 5G, control and data planes separation architecture
(SARC) has recently been conceived as a promising paradigm that has potential
to address most of aforementioned challenges. In this article, we review
various proposals that have been presented in literature so far to enable SARC.
More specifically, we analyze how and to what degree various SARC proposals
address the four main challenges in network densification namely: energy
efficiency, system level capacity maximization, interference management and
mobility management. We then focus on two salient features of future cellular
networks that have not yet been adapted in legacy networks at wide scale and
thus remain a hallmark of 5G, i.e., coordinated multipoint (CoMP), and
device-to-device (D2D) communications. After providing necessary background on
CoMP and D2D, we analyze how SARC can particularly act as a major enabler for
CoMP and D2D in context of 5G. This article thus serves as both a tutorial as
well as an up to date survey on SARC, CoMP and D2D. Most importantly, the
article provides an extensive outlook of challenges and opportunities that lie
at the crossroads of these three mutually entangled emerging technologies.Comment: 28 pages, 11 figures, IEEE Communications Surveys & Tutorials 201
Towards Fast, Adaptive, and Hardware-Assisted User-Space Scheduling
Modern datacenter applications are prone to high tail latencies since their
requests typically follow highly-dispersive distributions. Delivering fast
interrupts is essential to reducing tail latency. Prior work has proposed both
OS- and system-level solutions to reduce tail latencies for microsecond-scale
workloads through better scheduling. Unfortunately, existing approaches like
customized dataplane OSes, require significant OS changes, experience
scalability limitations, or do not reach the full performance capabilities
hardware offers.
The emergence of new hardware features like UINTR exposed new opportunities
to rethink the design paradigms and abstractions of traditional scheduling
systems. We propose LibPreemptible, a preemptive user-level threading library
that is flexible, lightweight, and adaptive. LibPreemptible was built with a
set of optimizations like LibUtimer for scalability, and deadline-oriented API
for flexible policies, time-quantum controller for adaptiveness. Compared to
the prior state-of-the-art scheduling system Shinjuku, our system achieves
significant tail latency and throughput improvements for various workloads
without modifying the kernel. We also demonstrate the flexibility of
LibPreemptible across scheduling policies for real applications experiencing
varying load levels and characteristics.Comment: Accepted by HPCA202
- …