1,722 research outputs found
On the Efficacy of Fine-Grained Traffic Splitting Protocols in Data Center Networks
Multi-rooted tree topologies are commonly used to construct high-bandwidth data center network fabrics. In these networks, switches typically rely on equal-cost multipath (ECMP) routing techniques to split traffic across multiple paths, such that packets within a flow traverse the same end-to-end path. Unfortunately, since ECMP splits traffic based on flow-granularity, it can cause load imbalance across paths resulting in poor utilization of network resources. More finegrained traffic splitting techniques are typically not preferred because they can cause packet reordering that can, according to conventional wisdom, lead to severe TCP throughput degradation. In this work, we revisit this fact in the context of regular data center topologies such as fat-tree architectures. We argue that packet-level traffic splitting, where packets of a flow are sprayed through all available paths, would lead to a better load-balanced network, which in turn leads to significantly more balanced queues and much higher throughput compared to ECMP
Performance Modelling and Optimisation of Multi-hop Networks
A major challenge in the design of large-scale networks is to predict and optimise the
total time and energy consumption required to deliver a packet from a source node to a
destination node. Examples of such complex networks include wireless ad hoc and sensor
networks which need to deal with the effects of node mobility, routing inaccuracies, higher
packet loss rates, limited or time-varying effective bandwidth, energy constraints, and the
computational limitations of the nodes. They also include more reliable communication
environments, such as wired networks, that are susceptible to random failures, security
threats and malicious behaviours which compromise their quality of service (QoS) guarantees.
In such networks, packets traverse a number of hops that cannot be determined
in advance and encounter non-homogeneous network conditions that have been largely
ignored in the literature. This thesis examines analytical properties of packet travel in
large networks and investigates the implications of some packet coding techniques on both
QoS and resource utilisation.
Specifically, we use a mixed jump and diffusion model to represent packet traversal
through large networks. The model accounts for network non-homogeneity regarding
routing and the loss rate that a packet experiences as it passes successive segments of a
source to destination route. A mixed analytical-numerical method is developed to compute
the average packet travel time and the energy it consumes. The model is able to capture
the effects of increased loss rate in areas remote from the source and destination, variable
rate of advancement towards destination over the route, as well as of defending against
malicious packets within a certain distance from the destination. We then consider sending
multiple coded packets that follow independent paths to the destination node so as to
mitigate the effects of losses and routing inaccuracies. We study a homogeneous medium
and obtain the time-dependent properties of the packet’s travel process, allowing us to
compare the merits and limitations of coding, both in terms of delivery times and energy
efficiency. Finally, we propose models that can assist in the analysis and optimisation
of the performance of inter-flow network coding (NC). We analyse two queueing models
for a router that carries out NC, in addition to its standard packet routing function. The
approach is extended to the study of multiple hops, which leads to an optimisation problem
that characterises the optimal time that packets should be held back in a router, waiting
for coding opportunities to arise, so that the total packet end-to-end delay is minimised
Architectural Enhancements for Data Transport in Datacenter Systems
Datacenter systems run myriad applications, which frequently communicate with each other and/or Input/Output (I/O) devices—including network adapters, storage devices, and accelerators. Due to the growing speed of I/O devices and the emergence of microservice-based programming models, the I/O software stacks have become a critical factor in end-to-end communication performance. As such, I/O software stacks have been evolving rapidly in recent years. Datacenters rely on fast, efficient “Software Data Planes”, which orchestrate data transfer between applications and I/O devices. The goal of this dissertation is to enhance the performance, efficiency, and scalability of software data planes by diagnosing their existing issues and addressing them through hardware-software solutions.
In the first step, I characterize challenges of modern software data planes, which bypass the operating system kernel to avoid associated overheads. Since traditional interrupts and system calls cannot be delivered to user code without kernel assistance, kernel-bypass data planes use spinning cores on I/O queues to identify work/data arrival. Spin-polling obviously wastes CPU cycles on checking empty queues; however, I show that it entails even more drawbacks: (1) Full-tilt spinning cores perform more (useless) polling work when there is less work pending in the queues. (2) Spin-polling scales poorly with the number of polled queues due to processor cache capacity constraints, especially when traffic is unbalanced. (3) Spin-polling also scales poorly with the number of cores due to the overhead of polling and operation rate limits. (4) Whereas shared queues can mitigate load imbalance and head-of-line blocking, synchronization overheads of spinning on them limit their potential benefits.
Next, I propose a notification accelerator, dubbed HyperPlane, which replaces spin-polling in software data planes. Design principles of HyperPlane are: (1) not iterating on empty I/O queues to find work/data in ready ones, (2) blocking/halting when all queues are empty rather than spinning fruitlessly, and (3) allowing multiple cores to efficiently monitor a shared set of queues. These principles lead to queue scalability, work proportionality, and enjoying theoretical merits of shared queues. HyperPlane is realized with a programming model front-end and a hardware microarchitecture back-end. Evaluation of HyperPlane shows its significant advantage in terms of throughput, average/tail latency, and energy efficiency over a state-of-the-art spin-polling-based software data plane, with very small power and area overheads.
Finally, I focus on the data transfer aspect in software data planes. Cache misses incurred by accessing I/O data are a major bottleneck in software data planes. Despite considerable efforts put into delivering I/O data directly to the last-level cache, some access latency is still exposed. Cores cannot prefetch such data to nearer caches in today's systems because of the complex access pattern of data buffers and the lack of an appropriate notification mechanism that can trigger the prefetch operations. As such, I propose HyperData, a data transfer accelerator based on targeted prefetching. HyperData prefetches exact (rather than predicted) data buffers (or a required subset to avoid cache pollution) to the L1 cache of the consumer core at the right time. Prefetching can be done for both core-peripheral and core-core communications. HyperData's prefetcher is programmable and supports various queue formats—namely, direct (regular), indirect (Virtio), and multi-consumer queues. I show that with a minor overhead, HyperData effectively hides data access latency in software data planes, thereby improving both application- and system-level performance and efficiency.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169826/1/hosseing_1.pd
Pakettiprosessointijärjestelmien Suorituskykyanalyysi
This thesis investigates the use of measurement, simulation, and modeling methods for the performance analysis of packet processing systems, and more precisely hardware accelerated multiprocessor system-on-chip (MPSoC) devices running task-parallel applications. To guarantee the tight latency and throughput requirements, the devices often incorporate complex hardware accelerated packet scheduling mechanisms. At the same time, due to the complexity of these systems, different software abstractions, such as task-based programming models, are used to develop packet processing applications. These challenges, together with dynamic characteristics of the packet streams makes the performance analysis of packet processing systems non-trivial.
We demonstrate that, with extended queue disciplines and support for modeling parallelism, resource network methodology is a viable approach for modeling complex MPSoC based systems running task-based parallel applications on dynamic workloads. The main contributions of our work are three-fold. First, we have extended the toolset of an existing in-house modeling and simulation software, Performance Simulation Environment. The extensions enable modeling of user-definable queue disciplines, which further enable flexible modeling of complex hardware interactions of MPSoCs and the parallelism of task-based programming models. Secondly, we have studied, instrumented, and measured the characteristics of a packet processing system. Finally we have modeled a multi-blade packet processing system with customizable workload and task-parallel application models, and run simulation experiments.
In both experiments, the model acts as expected. According to the experiment results, the resource network concept seems to be a viable tool for the performance analysis of packet processing systems. The chosen abstraction level provides desired balance between the functionality, ease of use, and simulation performance.Tässä työssä tutkitaan mittaus-, mallinnus-, ja simulaatiometodien käyttöä pakettiprosessisysteemien, tarkemmin ottaen tehtävärinnakkaisia sovelluksia ajavien laitteistokiihdytettyjen moniydinjärjestelmien, suorityskykyanalyysiin. Tiukoista viive- ja läpivirtausvaatimuksista johtue pakettiprosessointilaitteistot sisältävät usein monimutkaisia laitteistokiihdytettyjä pakettiajoitusmekanismeja. Laittestojen monimutkaisuudesta johtuen pakettiprosessointisovellusten kehittämiseen käytetään usein erilaisia ohjelmointiabstraktioita, kuten tehtävärinnakkaisia ohjelmointimalleja. Laitteston ja ohjelmiston asettamat haasteet yhdessä pakettivirtojen dynaamisen luonteen kanssa tekevät pakettiprosessointijärjestelmien suorituskykyanalyysista epätriviaalia.
Työssä havainnollistamme, että laajennettujen jonokurien ja rinnakkaismallinnustuen avulla resurssiverkkometodologia on toimiva lähestymistapa tehtävärinnakkaisia rinnakkaisohjelmointisovelluksia ajavien monimutkaisten laitteistokiihdytettyjen moniydinjärjestelmien suorituskykyanalyysiin dynaamisilla työkuormilla. Työmme päätulokset ovat kolmiosaiset. Ensinnäkin, olemme laajentaneet olemassaolevan mallinnus- ja simulaatioohjelmiston, Performance Simulation Environmentin, ohjelmointityökaluja. Laajennukset mahdollistavat käyttäjän määriteltävien jonokurien mallintamisen, mikä edelleen mahdollistaa tehtävärinnakkaisia sovelluksia ajavien laittestokiihdytettyjen moniydinjärjestelmien laittestovuorovaikutusten joustavan mallinnuksen. Toiseksi, olemme tutkineet ja mitanneet erään pakettiprosessointijärjestelmän ominaisuuksia. Viimeiseksi, olemme mallintaneet pakettiprosessointijärjestelmän muunnettavilla työkuormilla ja tehtävärinnakkaisilla sovellusmalleilla, sekä suorittaneet näitä simulaatiokokein.
Molempien kokeiden mallit käyttäytyvät odotetulla tavalla. Koetulosten perusteella resurssiverkkokonsepti vaikuttaa toimivalta työkalulta kompleksien pakettiprosessointijärjestelmien suorituskykyanalyysiin. Valittu abstraktiotaso tarjoaa toivotun tasapainon simulaation suorituskyvyn, toiminnallisuuden ja helppokäyttöisyyden välillä
Recommended from our members
Optimising data centre operation by removing the transport bottleneck
Data centres lie at the heart of almost every service on the Internet. Data centres are used to provide search results, to power social media, to store and index email, to host “cloud” applications, for online retail and to provide a myriad of other web services. Consequently the more efficient they can be made the better for all of us. The power of modern data centres is in combining commodity off-the-shelf server hardware and network equipment to provide what Google’s Barrosso and Ho ̈lzle describe as “warehouse scale” computers.
Data centres rely on TCP, a transport protocol that was originally designed for use in the Internet. Like other such protocols, TCP has been optimised to maximise throughput, usually by filling up queues at the bottleneck. However, for most applications within a data centre network latency is more critical than throughput. Consequently the choice of transport protocol becomes a bottleneck for performance. My thesis is that the solution to this is to move away from the use of one-size-fits-all transport protocols towards ones that have been designed to reduce latency across the data centre and which can dynamically respond to the needs of the applications.
This dissertation focuses on optimising the transport layer in data centre networks. In particular I address the question of whether any single transport mechanism can be flexible enough to cater to the needs of all data centre traffic. I show that one leading protocol (DCTCP) has been heavily optimised for certain network conditions. I then explore approaches that seek to minimise latency for applications that care about it while still allowing throughput-intensive applications to receive a good level of service. My key contributions to this are Silo and Trevi.
Trevi is a novel transport system for storage traffic that utilises fountain coding to max- imise throughput and minimise latency while being agnostic to drop, thus allowing storage traffic to be pushed out of the way when latency sensitive traffic is present in the network. Silo is an admission control system that is designed to give tenants of a multi-tenant data centre guaranteed low latency network performance. Both of these were developed in collaboration with others
Scalable and Adaptive Load Balancing on IBM PowerNP
Web and other Internet-based server farms are a critical company resource. A solution to the increased complexity of server farms and to the need to improve the server performance in terms of scalability, fault tolerance and management is to implement a load balancing technique. It consists of a front-end machine which intelligently redirects the traffic to several Real Servers. We discuss the feasibility of implementing adaptive load balancing with minimal flow disruption on the IBM PowerNP Network Processor. We focus our attention on the steady-state part of the algorithm and propose a PowerNP-tailored mapping algorithm derived from Robust Hash Mapping. We propose and show a fast algorithm solution (despite the simple arithmetical logic of the PowerNP), as well as a scalable approach (aiming at minimizing the packet processing time) and, finally, we present some initial performance results
Reducing Internet Latency : A Survey of Techniques and their Merit
Bob Briscoe, Anna Brunstrom, Andreas Petlund, David Hayes, David Ros, Ing-Jyh Tsang, Stein Gjessing, Gorry Fairhurst, Carsten Griwodz, Michael WelzlPeer reviewedPreprin
Understanding and Improving the Performance of Read Operations Across the Storage Stack
We live in a data-driven era, large amounts of data are generated and collected every day. Storage systems are the backbone of this era, as they store and retrieve data. To cope with increasing data demands (e.g., diversity, scalability), storage systems are experiencing changes across the stack. As other computer systems, storage systems rely on layering and modularity, to allow rapid development. Unfortunately, this can hinder performance clarity and introduce degradations (e.g., tail latency), due to unexpected interactions between components of the stack. In this thesis, we first perform a study to understand the behavior across different layers of the storage stack. We focus on sequential read workloads, a common I/O pattern in distributed le systems (e.g., HDFS, GFS). We analyze the interaction between read workloads, local le systems (i.e., ext4), and storage media (i.e., SSDs). We perform the same experiment over different periods of time (e.g., le lifetime). We uncover 3 slowdowns, all of which occur in the lower layers. When combined, these slowdowns can degrade throughput by 30%. We find that increased parallelism on the local le system mitigates these slowdowns, showing the need for adaptability in storage stacks. Given the fact that performance instabilities can occur at any layer of the stack, it is important that upper-layer systems are able to react. We propose smart hedging, a novel technique to manage high-percentile (tail) latency variations in read operations. Smart hedging considers production challenges, such as massive scalability, heterogeneity, and ease of deployment and maintainability. Our technique establishes a dynamic threshold by tracking latencies on the client-side. If a read operation exceeds the threshold, a new hedged request is issued, in an exponential back-off manner. We implement our technique in HDFS and evaluate it on 70k servers in 3 datacenters. Our technique reduces average tail latency, without generating excessive system load
- …