23 research outputs found
Fastpass: A Centralized “Zero-Queue” Datacenter Network
An ideal datacenter network should provide several properties, including low median and tail latency, high utilization (throughput), fair allocation of network resources between users or applications, deadline-aware scheduling, and congestion (loss) avoidance. Current datacenter networks inherit the principles that went into the design of the Internet, where packet transmission and path selection decisions are distributed among the endpoints and routers. Instead, we propose that each sender should delegate control—to a centralized arbiter—of when each packet should be transmitted and what path it should follow. This paper describes Fastpass, a datacenter network architecture built using this principle. Fastpass incorporates two fast algorithms: the first determines the time at which each packet should be transmitted, while the second determines the path to use for that packet. In addition, Fastpass uses an efficient protocol between the endpoints and the arbiter and an arbiter replication strategy for fault-tolerant failover. We deployed and evaluated Fastpass in a portion of Facebook’s datacenter network. Our results show that Fastpass achieves high throughput comparable to current networks at a 240 reduction is queue lengths (4.35 Mbytes reducing to 18 Kbytes), achieves much fairer and consistent flow throughputs than the baseline TCP (5200 reduction in the standard deviation of per-flow throughput with five concurrent connections), scalability from 1 to 8 cores in the arbiter implementation with the ability to schedule 2.21 Terabits/s of traffic in software on eight cores, and a 2.5 reduction in the number of TCP retransmissions in a latency-sensitive service at Facebook.National Science Foundation (U.S.) (grant IIS-1065219)Irwin Mark Jacobs and Joan Klein Jacobs Presidential FellowshipHertz Foundation (Fellowship
Recommended from our members
Source-Routed Multicast Schemes for Large-Scale Cloud Data Center Networks
Data centers (DCs) have been witnessing unprecedented growth in size, number and complexity in recent years. They consist of tens of thousands of servers interconnected by fast network switches, hosting and enabling numerous applications with various traffic characteristics and requirements. As a result, DC networks have been presented with several unique challenges, pertaining to the scaling and allocation of network resources during the forwarding and moving of data across the different DC servers. Traffic routing in general and multicast routing in particular are important functions in DC networks, especially that modern cloud DCs tend to exhibit one-to-many communication traffic patterns. Unfortunately, recent multicast routing approaches that adopt IP multicast suffer from scalability and load balancing issues, and do not scale well with the number of supported multicast groups when used for cloud DC networks. In this thesis, we propose a set of new, complementary schemes that overcome these challenges. More specifically, firstly, we study existing DC network topologies, and propose Circulant Fat-Tree topology, an improvement over the traditional Fat-Tree topology with better properties to suit nowadays DC networks. Then, we review and classify recent studies that investigate and measure the traffic behavior of operational DC networks. We focus on the way they collect the traffic as well as on the key findings made in these studies.
Secondly, we propose Bert, a source-initiated multicast routing scheme for DCs. Bert scales well with both the number and the size of multicast groups, and does so through clustering, by dividing the members of the multicast group into a set of clusters with each cluster employing its own forwarding rules. In essence, Bert yields much lesser multicast traffic overhead than state-of-the-art schemes.
Thirdly, we propose, Ernie, a scalable and load-balanced multicast source routing scheme. Ernie introduces a novel method for scaling out the number of supported mul- ticast groups. In particular, it appropriately constructs and organizes multicast header information inside packets in a manner that allows core/root switches to only forward down the needed information. Ernie also introduces an effective multicast traffic load balancing technique across downstream links. Specifically, it prudently assigns multicast groups to core switches to ensure the evenness of load distribution across the downstream links
A Survey on Data Plane Programming with P4: Fundamentals, Advances, and Applied Research
With traditional networking, users can configure control plane protocols to
match the specific network configuration, but without the ability to
fundamentally change the underlying algorithms. With SDN, the users may provide
their own control plane, that can control network devices through their data
plane APIs. Programmable data planes allow users to define their own data plane
algorithms for network devices including appropriate data plane APIs which may
be leveraged by user-defined SDN control. Thus, programmable data planes and
SDN offer great flexibility for network customization, be it for specialized,
commercial appliances, e.g., in 5G or data center networks, or for rapid
prototyping in industrial and academic research. Programming
protocol-independent packet processors (P4) has emerged as the currently most
widespread abstraction, programming language, and concept for data plane
programming. It is developed and standardized by an open community and it is
supported by various software and hardware platforms. In this paper, we survey
the literature from 2015 to 2020 on data plane programming with P4. Our survey
covers 497 references of which 367 are scientific publications. We organize our
work into two parts. In the first part, we give an overview of data plane
programming models, the programming language, architectures, compilers,
targets, and data plane APIs. We also consider research efforts to advance P4
technology. In the second part, we analyze a large body of literature
considering P4-based applied research. We categorize 241 research papers into
different application domains, summarize their contributions, and extract
prototypes, target platforms, and source code availability.Comment: Submitted to IEEE Communications Surveys and Tutorials (COMS) on
2021-01-2
RDNA: Arquitetura Definida por Resíduos para Redes de Data Centers
"Recentemente, temos observado o crescente uso das tecnologias de informação e da comunicação. Instituições e usuários simplesmente necessitam de alta qualidade na conectividade de seus dados, com expectativa de acesso instantâneo a qualquer hora e em qualquer lugar. Um elemento essencial para garantir qualidade na conectividade da nuvem é a arquitetura da rede de comunicação no Data Center (DCNs - Data Center Networks). Isso ocorre porque uma parte significativa do tráfego da Internet é baseada na comunicação de dados e no processamento que acontece dentro da infraestrutura do Data Center (DC). No entanto, os protocolos de roteamento, a forma de encaminhamento e gerenciamento que são executados atualmente, se revelam insuficientes para atender as demandas atuais por conectividade na nuvem. Isto ocorre principalmente pela dependência da operação de busca nas tabelas de encaminhamento, levando à um incremento de latência fim a fim, ademais, mecanismos de recuperação tradicionais utilizam estados adicionais
nas tabelas, aumentando a complexidade nas rotinas de gerenciamento, além de reduzir drasticamente a escalabilidade de proteção nas rotas. Outra dificuldade é a comunicação multicast dentro do DC, as soluções existentes são complexas de implementar e não suportam a configuração dos grupos nas taxas atuais requeridas.
Neste contexto, essa tese explora o sistema numérico de resíduos centrado no Teorema Chinês do Resto (TCR) como fundamento, aplicado no projeto de um novo sistema de roteamento para DCN. Mais especificamente, introduzimos a arquitetura RDNA que avança o estado da arte a partir de uma simplificação do modelo de encaminhamento para o núcleo, baseado em uma operação de resíduo (resto da divisão). Nesse sentido, a rota é definida como resíduo entre um identificador de rota e identificadores locais (números primos) atribuídos aos switches de núcleo. Os switches de borda, recebem entradas configurando os fluxos de acordo com a política de rede definida pelo controlador. Cada fluxo é mapeado na borda, através de um identificador de rota principal e um emergencial. Essas operações de resíduos permitem encaminhar os pacotes pela respectiva porta de saída. Em situações de falha, o identificador de rota emergencial viabiliza rápida recuperação enviando os pacotes por uma porta de saída alternativa. A RDNA é escalável assumindo uma topologia 2-tier Clos Network amplamente utilizada em DCNs. Com o objetivo de confrontar a RDNA com outros trabalhos da literatura, analisamos a escalabilidade em termos de número de bits necessário para comunicação unicast e multicast. Na análise, variou-se o número de nós na rede, o grau dos nós e o número de hosts físicos para cada topologia. Na comunicação unicast, a RDNA reduziu em 4.5 vezes o tamanho do cabeçalho, comparada à proposta COXCast. Na comunicação multicast, um modelo de programação linear foi concebido para minimizar uma função polinomial. A RDNA reduziu em até 50% o tamanho do cabeçalho comparando com a mesma quantidade de membros por grupo.
Como prova de conceito, dois protótipos foram implementados, um no ambiente emulado Mininet e outro na plataforma NetFPGA SUME. Os resultados mostraram que a RDNA alcança latência determinística no encaminhamento dos pacotes, 600 nanosegundos no tempo de comutação por elemento de núcleo, recuperação de falha ultra-rápida na ordem de microssegundos e sem variação de latência (jitter) no núcleo da rede.
Resilient and Scalable Forwarding for Software-Defined Networks with P4-Programmable Switches
Traditional networking devices support only fixed features and limited configurability.
Network softwarization leverages programmable software and hardware platforms to remove those limitations.
In this context the concept of programmable data planes allows directly to program the packet processing pipeline of networking devices and create custom control plane algorithms.
This flexibility enables the design of novel networking mechanisms where the status quo struggles to meet high demands of next-generation networks like 5G, Internet of Things, cloud computing, and industry 4.0.
P4 is the most popular technology to implement programmable data planes.
However, programmable data planes, and in particular, the P4 technology, emerged only recently.
Thus, P4 support for some well-established networking concepts is still lacking and several issues remain unsolved due to the different characteristics of programmable data planes in comparison to traditional networking.
The research of this thesis focuses on two open issues of programmable data planes.
First, it develops resilient and efficient forwarding mechanisms for the P4 data plane as there are no satisfying state of the art best practices yet.
Second, it enables BIER in high-performance P4 data planes.
BIER is a novel, scalable, and efficient transport mechanism for IP multicast traffic which has only very limited support of high-performance forwarding platforms yet.
The main results of this thesis are published as 8 peer-reviewed and one post-publication peer-reviewed publication. The results cover the development of suitable resilience mechanisms for P4 data planes, the development and implementation of resilient BIER forwarding in P4, and the extensive evaluations of all developed and implemented mechanisms. Furthermore, the results contain a comprehensive P4 literature study.
Two more peer-reviewed papers contain additional content that is not directly related to the main results.
They implement congestion avoidance mechanisms in P4 and develop a scheduling concept to find cost-optimized load schedules based on day-ahead forecasts
Doctor of Philosophy
dissertationIn the past few years, we have seen a tremendous increase in digital data being generated. By 2011, storage vendors had shipped 905 PB of purpose-built backup appliances. By 2013, the number of objects stored in Amazon S3 had reached 2 trillion. Facebook had stored 20 PB of photos by 2010. All of these require an efficient storage solution. To improve space efficiency, compression and deduplication are being widely used. Compression works by identifying repeated strings and replacing them with more compact encodings while deduplication partitions data into fixed-size or variable-size chunks and removes duplicate blocks. While we have seen great improvements in space efficiency from these two approaches, there are still some limitations. First, traditional compressors are limited in their ability to detect redundancy across a large range since they search for redundant data in a fine-grain level (string level). For deduplication, metadata embedded in an input file changes more frequently, and this introduces more unnecessary unique chunks, leading to poor deduplication. Cloud storage systems suffer from unpredictable and inefficient performance because of interference among different types of workloads. This dissertation proposes techniques to improve the effectiveness of traditional compressors and deduplication in improving space efficiency, and a new IO scheduling algorithm to improve performance predictability and efficiency for cloud storage systems. The common idea is to utilize similarity. To improve the effectiveness of compression and deduplication, similarity in content is used to transform an input file into a compression- or deduplication-friendly format. We propose Migratory Compression, a generic data transformation that identifies similar data in a coarse-grain level (block level) and then groups similar blocks together. It can be used as a preprocessing stage for any traditional compressor. We find metadata have a huge impact in reducing the benefit of deduplication. To isolate the impact from metadata, we propose to separate metadata from data. Three approaches are presented for use cases with different constrains. For the commonly used tar format, we propose Migratory Tar: a data transformation and also a new tar format that deduplicates better. We also present a case study where we use deduplication to reduce storage consumption for storing disk images, while at the same time achieving high performance in image deployment. Finally, we apply the same principle of utilizing similarity in IO scheduling to prevent interference between random and sequential workloads, leading to efficient, consistent, and predictable performance for sequential workloads and a high disk utilization