Search CORE

8 research outputs found

Why Does Flow Director Cause Packet Reordering?

Author: Crawford Matt
DeMar Phil
Wu Wenji
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 02/06/2011
Field of study

Intel Ethernet Flow Director is an advanced network interface card (NIC) technology. It provides the benefits of parallel receive processing in multiprocessing environments and can automatically steer incoming network data to the same core on which its application process resides. However, our analysis and experiments show that Flow Director cannot guarantee in-order packet delivery in multiprocessing environments. Packet reordering causes various negative impacts. E.g., TCP performs poorly with severe packet reordering. In this paper, we use a simplified model to analyze why Flow Director can cause packet reordering. Our experiments verify our analysis

arXiv.org e-Print Archive

Crossref

Improving network connection locality on multicore systems

Author: Morris Robert Tappan
Pesterev Aleksey
Strauss Jacob
Zeldovich Nickolai
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2012
Field of study

Incoming and outgoing processing for a given TCP connection often execute on different cores: an incoming packet is typically processed on the core that receives the interrupt, while outgoing data processing occurs on the core running the relevant user code. As a result, accesses to read/write connection state (such as TCP control blocks) often involve cache invalidations and data movement between cores' caches. These can take hundreds of processor cycles, enough to significantly reduce performance. We present a new design, called Affinity-Accept, that causes all processing for a given TCP connection to occur on the same core. Affinity-Accept arranges for the network interface to determine the core on which application processing for each new connection occurs, in a lightweight way; it adjusts the card's choices only in response to imbalances in CPU scheduling. Measurements show that for the Apache web server serving static files on a 48-core AMD system, Affinity-Accept reduces time spent in the TCP stack by 30% and improves overall throughput by 24%.National Science Foundation (U.S.). (Grant number CNS-0834415)National Science Foundation (U.S.). (Grant number CNS-0915164)Quanta Computer (Firm

CiteSeerX

DSpace@MIT

Crossref

NIC-assisted cache-efficient receive stack for message passing over Ethernet

Author: Bailey
Browne
Frigo
Goglin
Goglin
Goglin
Huggahalli
Passas
Publication venue: 'Wiley'
Publication date: 01/01/2011
Field of study

International audienceHigh-speed networking in clusters usually relies on advanced hardware features in the NICs, such as zero-copy capability. Open-MX is a high-performance message passing stack tailored for regular Ethernet hardware without such capabilities. We present the addition of a multiqueue support in the Open-MX receive stack so that all incoming packets for the same process are handled on the same core. We then introduce the idea of binding the target end process near its dedicated receive queue. This model leads to a more cache-efficient receive stack for Open-MX. It also proves that very simple and stateless hardware features may have a significant impact on message passing performance over Ethernet. The implementation of this model in a firmware reveals that it may not be as efficient as some manually tuned micro-benchmarks. But our multiqueue receive stack generally performs better than the original single queue stack, especially on large communication patterns where multiple processes are involved and manual binding is difficult

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Oskar Bordeaux

Recommended from our members

System Design for Software Packet Processing

Author: Han Sangjin
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

The role of software in computer networks has never been more crucial than today, with the advent of Internet-scale services and cloud computing. The trend toward software-based network dataplane—as in network function virtualization—requires software packet processing to meet challenging perfomance requirements, such as supporting exponentially increasing link bandwidth and microsecond-order latency. Many architectural aspects of existing software systems for packet processing, however, are decades old and ill-suited totoday’s network I/O workloads.In this dissertation, we explore the design space of high-performance software packet processing systems in the context of two application domains, . First, we start by discussingthe limitations of BSD Socket, which is a de-facto standard in network I/O for server applications. We quantify its performance limitations and propose a clean-slate API, called MegaPipe, as an alternative to BSD Socket. In the second part of this dissertation, we switch our focus to in-network software systems for network functions, such as network switches and middleboxes. We present Berkeley Extensible Software Switch (BESS), a modular framework for building extensible network functions. BESS introduces various novel techniques to achieve high-performance software packet processing, without compromising on either programmability or flexibility

eScholarship - University of California

Design of a scalable network interface to support enhanced TCP and UDP processing for high speed networks

Author: Elbeshti Mohamed
Publication venue
Publication date: 01/01/2014
Field of study

Communication networks have advanced rapidly in providing additional services, with improvements made to their bandwidth and the integration of advanced technology. As the speed of networks exceeds 10 Gbps, the time frame for completing the processing of TCP and UDP packets has become extremely short. The design and implementation of high performance Network Interfaces (NIs) that can support offload protocol functions for current and next-generation networks is challenging. In this thesis two software approaches are presented to enhance protocol processing of TCP and UDP in the network interface. A novel software Large Receive Offload (LRO) approach for enhancing the receiving side has been proposed. The LRO works by aggregating the incoming TCP and UDP packets into larger packets inside the NI’s buffer. The receiving side software has been improved to support out-of-order packets. The second proposed software solution is applied on the Large Send Offload (LSO). The proposed LSO function processing is implemented by segmenting TCP and UDP messages that are larger than the Maximum Transmission Unit to the Maximum Segment Size. New packet headers are generated for each new outgoing packet. A scalable programmable NI based 32-bit RISC core is presented that can support 100 Gbps network speeds. Acceleration of the processing time frame required at the NI has been implemented to prevent hazards (such as Data Hazard and Control Hazard) during the execution of the LRO and the LSO functions. An R2000/3000 RISC has been used in order to test the LRO and LSO functions and to discover the instruction set that is most suitable. Following this the VHDL NI was implemented with three pipeline RISC cores, a simple DMA controller and Content Addressable Memory. An evaluation of the desired RISC clock rate that is required to process TCP and UDP streams at 100 Gbps was conducted. It was determined that a RISC core running at 752 MHz with a DMA clock of 3753 MHz was able to process packets 512 bytes or larger fast enough to support 100 Gbps network speeds

Research Repository

Abstract This paper originally appeared at USENIX 2006 An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems

Author
Publication venue
Publication date
Field of study

chip multiprocessor designs, operating system network stacks must be parallelized in order to keep pace with improvements in network bandwidth. There are two competing strategies for stack parallelization. Messageparallel network stacks use concurrent threads to carry out network operations on independent messages (usually packets), whereas connection-parallel stacks map operations to groups of connections and permit concurrent processing on independent connection groups. Connection-parallel stacks can use either locks or threads to serialize access to connection groups. This paper evaluates these parallel stack organizations using a modern operating system and chip multiprocessor hardware. Compared to uniprocessor kernels, all parallel stack organizations incur additional locking overhead, cache inefficiencies, and scheduling overhead. However, the organizations balance these limitations differently, leading to variations in peak performance and connection scalability. Lock-serialized connection-parallel organizations reduce the locking overhead of message-parallel organizations by using many connection groups and eliminate the expensive thread handoff mechanism of thread-serialized connection-parallel organizations. The resultant organization outperforms the others, delivering 5.4 Gb/s of TCP throughput for most connection loads and providing a 126 % throughput improvement versus a uniprocessor for the heaviest connection loads.

CiteSeerX