7 research outputs found

    Why Does Flow Director Cause Packet Reordering?

    Full text link
    Intel Ethernet Flow Director is an advanced network interface card (NIC) technology. It provides the benefits of parallel receive processing in multiprocessing environments and can automatically steer incoming network data to the same core on which its application process resides. However, our analysis and experiments show that Flow Director cannot guarantee in-order packet delivery in multiprocessing environments. Packet reordering causes various negative impacts. E.g., TCP performs poorly with severe packet reordering. In this paper, we use a simplified model to analyze why Flow Director can cause packet reordering. Our experiments verify our analysis

    IP packet capture on high throughput networks by using NUMA architectures

    Get PDF
    Capture packets from IP networks is a commonly used technique in many IT fields for monitoring and analysis the IP traffic flows exchanged over computer networks. The new infrastructures for high throughput networks, however, have made this important technique more and more complex to carry out (also in small or medium sized local networks) even if using the newest multi-core systems developed today. This paper therefore shows some limits of a recent packet capture system, implemented by using a NUMA architecture, and suggests a possible solution that could be adopted to obtain a full functional IP packet capture system on high throughput networks

    IP packet capture on high throughput networks by using NUMA architectures

    Get PDF
    Capture packets from IP networks is a commonly used technique in many IT fields for monitoring and analysis the IP traffic flows exchanged over computer networks. The new infrastructures for high throughput networks, however, have made this important technique more and more complex to carry out (also in small or medium sized local networks) even if using the newest multi-core systems developed today. This paper therefore shows some limits of a recent packet capture system, implemented by using a NUMA architecture, and suggests a possible solution that could be adopted to obtain a full functional IP packet capture system on high throughput networks

    Architectural Breakdown of End-to-End Latency in a TCP/IP Network

    Full text link

    High Performance Web Servers: A Study In Concurrent Programming Models

    Get PDF
    With the advent of commodity large-scale multi-core computers, the performance of software running on these computers has become a challenge to researchers and enterprise developers. While academic research and industrial products have moved in the direction of writing scalable and highly available services using distributed computing, single machine performance remains an active domain, one which is far from saturated. This thesis selects an archetypal software example and workload in this domain, and describes software characteristics affecting performance. The example is highly-parallel web-servers processing a static workload. Particularly, this work examines concurrent programming models in the context of high-performance web-servers across different architectures — threaded (Apache, Go and μKnot), event-driven (Nginx, μServer) and staged (WatPipe) — compared with two static workloads in two different domains. The two workloads are a Zipf distribution of file sizes representing a user session pulling an assortment of many small and a few large files, and a 50KB file representing chunked streaming of a large audio or video file. Significant effort is made to fairly compare eight web-servers by carefully tuning each via their adjustment parameters. Tuning plays a significant role in workload-specific performance. The two domains are no disk I/O (in-memory file set) and medium disk I/O. The domains are created by lowering the amount of RAM available to the web-server from 4GB to 2GB, forcing files to be evicted from the file-system cache. Both domains are also restricted to 4 CPUs. The primary goal of this thesis is to examine fundamental performance differences between threaded and event-driven concurrency models, with particular emphasis on user-level threading models. Additionally, a secondary goal of the work is to examine high-performance software under restricted hardware environments. Over-provisioned hardware environments can mask architectural and implementation shortcomings in software – the hypothesis in this work is that restricting resources stresses the application, bringing out important performance characteristics and properties. Experimental results for the given workload show that memory pressure is one of the most significant factors for the degradation of web-server performance, because it forces both the onset and amount of disk I/O. With an ever increasing need to support more content at faster rates, a web-server relies heavily on in-memory caching of files and related content. In fact, personal and small business web-servers are even run on minimal hardware, like the Raspberry Pi, with only 1GB of RAM and a small SD card for the file system. Therefore, understanding behaviour and performance in restricted contexts should be a normal aspect of testing a web server (and other software systems)

    Performance Comparison of Uniprocessor and Multiprocessor Web Server Architectures

    Get PDF
    This thesis examines web-server architectures for static workloads on both uniprocessor and multiprocessor systems to determine the key factors affecting their performance. The architectures examined are event-driven (userver) and pipeline (WatPipe). As well, a thread-per-connection (Knot) architecture is examined for the uniprocessor system. Various workloads are tested to determine their effect on the performance of the servers. Significant effort is made to ensure a fair comparison among the servers. For example, all the servers are implemented in C or C++, and support sendfile and edge-triggered epoll. The existing servers, Knot and userver, are extended as necessary, and the new pipeline-server, WatPipe, is implemented using userver as its initial code base. Each web server is also tuned to determine its best configuration for a specific workload, which is shown to be critical to achieve best server performance. Finally, the server experiments are verified to ensure each is performing within reasonable standards. The performance of the various architectures is examined on a uniprocessor system. Three workloads are examined: no disk-I/O, moderate disk-I/O and heavy disk-I/O. These three workloads highlight the differences among the architectures. As expected, the experiments show the amount of disk I/O is the most significant factor in determining throughput, and once there is memory pressure, the memory footprint of the server is the crucial performance factor. The peak throughput differs by only 9-13% among the best servers of each architecture across the various workloads. Furthermore, the appropriate configuration parameters for best performance varied based on workload, and no single server performed the best for all workloads. The results show the event-driven and pipeline servers have equivalent throughput when there is moderate or no disk-I/O. The only difference is during the heavy disk-I/O experiments where WatPipe's smaller memory footprint for its blocking server gave it a performance advantage. The Knot server has 9% lower throughput for no disk-I/O and moderate disk-I/O and 13% lower for heavy disk-I/O, showing the extra overheads incurred by thread-per-connection servers, but still having performance close to the other server architectures. An unexpected result is that blocking sockets with sendfile outperforms non-blocking sockets with sendfile when there is heavy disk-I/O because of more efficient disk access. Next, the performance of the various architectures is examined on a multiprocessor system. Knot is excluded from the experiments as its underlying thread library, Capriccio, only supports uniprocessor execution. For these experiments, it is shown that partitioning the system so that server processes, subnets and requests are handled by the same CPU is necessary to achieve high throughput. Both N-copy and new hybrid versions of the uniprocessor servers, extended to support partitioning, are tested. While the N-copy servers performed the best, new hybrid versions of the servers also performed well. These hybrid servers have throughput within 2% of the N-copy servers but offer benefits over N-copy such as a smaller memory footprint and a shared address-space. For multiprocessor systems, it is shown that once the system becomes disk bound, the throughput of the servers is drastically reduced. To maximize performance on a multiprocessor, high disk throughput and lots of memory are essential
    corecore