141 research outputs found
Sparse Allreduce: Efficient Scalable Communication for Power-Law Data
Many large datasets exhibit power-law statistics: The web graph, social
networks, text data, click through data etc. Their adjacency graphs are termed
natural graphs, and are known to be difficult to partition. As a consequence
most distributed algorithms on these graphs are communication intensive. Many
algorithms on natural graphs involve an Allreduce: a sum or average of
partitioned data which is then shared back to the cluster nodes. Examples
include PageRank, spectral partitioning, and many machine learning algorithms
including regression, factor (topic) models, and clustering. In this paper we
describe an efficient and scalable Allreduce primitive for power-law data. We
point out scaling problems with existing butterfly and round-robin networks for
Sparse Allreduce, and show that a hybrid approach improves on both.
Furthermore, we show that Sparse Allreduce stages should be nested instead of
cascaded (as in the dense case). And that the optimum throughput Allreduce
network should be a butterfly of heterogeneous degree where degree decreases
with depth into the network. Finally, a simple replication scheme is introduced
to deal with node failures. We present experiments showing significant
improvements over existing systems such as PowerGraph and Hadoop
Datacenter Traffic Control: Understanding Techniques and Trade-offs
Datacenters provide cost-effective and flexible access to scalable compute
and storage resources necessary for today's cloud computing needs. A typical
datacenter is made up of thousands of servers connected with a large network
and usually managed by one operator. To provide quality access to the variety
of applications and services hosted on datacenters and maximize performance, it
deems necessary to use datacenter networks effectively and efficiently.
Datacenter traffic is often a mix of several classes with different priorities
and requirements. This includes user-generated interactive traffic, traffic
with deadlines, and long-running traffic. To this end, custom transport
protocols and traffic management techniques have been developed to improve
datacenter network performance.
In this tutorial paper, we review the general architecture of datacenter
networks, various topologies proposed for them, their traffic properties,
general traffic control challenges in datacenters and general traffic control
objectives. The purpose of this paper is to bring out the important
characteristics of traffic control in datacenters and not to survey all
existing solutions (as it is virtually impossible due to massive body of
existing research). We hope to provide readers with a wide range of options and
factors while considering a variety of traffic control mechanisms. We discuss
various characteristics of datacenter traffic control including management
schemes, transmission control, traffic shaping, prioritization, load balancing,
multipathing, and traffic scheduling. Next, we point to several open challenges
as well as new and interesting networking paradigms. At the end of this paper,
we briefly review inter-datacenter networks that connect geographically
dispersed datacenters which have been receiving increasing attention recently
and pose interesting and novel research problems.Comment: Accepted for Publication in IEEE Communications Surveys and Tutorial
Recommended from our members
RAS enhancements for RDMA communications
textEthernet as the communication medium in the enterprise data center has outlived all competing mediums and resisted the test of time with regards to speed and costs. The future is also poised for growth with 40 and 100Gps speeds just over horizon. The current state of the technology is being enhanced and extended with lossless features to allow for fabric convergence of Storage and Inter Process Communication (IPC) Networks. It is under this medium that an increase in the adoption of Remote Direct Memory Access (RDMA) over Ethernet using offloaded TCP/IP (iWARP) and Infiniband over Ethernet (RoCE) communication stacks to RDMA capable NIC adapter s (RNIC) is observed.
RDMA enables direct application to application communication over the network resulting in numerous and significant benefits such as reduced CPU utilization, lower latency communications, increased energy efficiency, and reduced overall system requirements. However, with said benefits also comes increased software complexity in how RDMA interface users communicate. The RDMA communication semantics, which originate from the HPC domain, are heavily biased towards Low-Latency and High-Bandwidth communications rather than Reliability, Availability, and Serviceability (RAS). As adoption increases, and enterprise data centers begins to leverage RDMA over Ethernet, enhancements to the OS stack software architecture and design of the components involved is required to address these deficiencies. Operating system interfaces, device drivers, adapter hardware design, and embedded firmware features must be viewed from a high-availability and maintainability point of view.
RAS enhancements for RDMA communications proposes the software architectural tradeoffs for enhancing the iWARP and RoCE RDMA implementations for communications in the enterprise data center, with new and traditional RAS features for existing communications stacks and devices. The architecture leverages software enhancements in traceability, availability, maintainability, serviceability, fault-isolation and resource management; such that in the advent of errors, the probability that the forensics data points to identify root cause are immediately and automatically available is increased.Electrical and Computer Engineerin
RDMA mechanisms for columnar data in analytical environments
Dissertação de mestrado integrado em Engenharia InformáticaThe amount of data in information systems is growing constantly and, as a consequence, the
complexity of analytical processing is greater. There are several storage solutions to persist
this information, with different architectures targeting different use cases. For analytical
processing, storage solutions with a column-oriented format are particularly relevant due
to the convenient placement of the data in persistent storage and the closer mapping to
in-memory processing.
The access to the database is typically remote and has overhead associated, mainly when
it is necessary to obtain the same data multiple times. Thus, it is desirable to have a cache
on the processing side and there are solutions for this. The problem with the existing so lutions is the overhead introduced by network latency and memory-copy between logical
layers. Remote Direct Memory Access (RDMA) mechanisms have the potential to help min imize this overhead. Furthermore, this type of mechanism is indicated for large amounts of
data because zero-copy has more impact as the data volume increases. One of the problems
associated with RDMA mechanisms is the complexity of development. This complexity is
induced by its different development paradigm when compared to other network commu nication protocols, for example, TCP.
Aiming to improve the efficiency of analytical processing, this dissertation presents a dis tributed cache that takes advantage of RDMA mechanisms to improve analytical processing
performance. The cache abstracts the intricacies of RDMA mechanisms and is developed
as a middleware making it transparent to take advantage of this technology. Moreover, this
technique could be used in other contexts where a distributed cache makes sense, such as
a set of replicated web servers that access the same database.A quantidade de informação nos sistemas informáticos tem vindo a aumentar e consequentemente, a complexidade do processamento analítico torna-se maior. Existem diversas soluções para o armazenamento de dados com diferentes arquiteturas e indicadas para determinados casos de uso. Num contexto de processamento analítico, uma solução com o modelo de dados colunar e especialmente relevante devido à disposição conveniente dos dados em disco e a sua proximidade com o mapeamento em memória desses mesmos dados. Muitas vezes, o acesso aos dados é feito remotamente e isso traz algum overhead, principalmente quando é necessário aceder aos mesmos dados mais do que uma vez. Posto isto, é vantajoso fazer caching dos dados e já existem soluções para esse efeito. O overhead introduzido pela latência da rede e cópia de buffers entre camadas lógicas é o principal problema das soluções existentes. Os mecanismos de acesso direto à memória remota (RDMA - Remote Direct Memory Access) tem o potencial de melhorar o desempenho neste cenário. Para além disso, este tipo de tecnologia faz sentido em sistemas com grandes quantidades de dados, nos quais o acesso direto pode ter um impacto ainda maior por ser zero-copy. Um dos problemas associados com mecanismos RDMA é a complexidade de desenvolvimento. Esta complexidade é causada pelo paradigma de desenvolvimento completamente diferente de outros protocolos de comunicação, como por exemplo, TCP. Tendo em vista melhorar a eficiência do processamento analítico, esta dissertação propõe uma solução de cache distribuída que tira partido de mecanismos de acesso direto a memoria remota (RDMA). A cache abstrai as particularidades dos mecanismos RDMA e é disponibilizada como middleware, tornando a utilização desta tecnologia completamente transparente. Esta solução visa os sistemas de processamento analítico, mas poderá ser utilizada noutros contextos em que uma cache distribuída faça sentido, como por exemplo num conjunto de servidores web replicados que acedem a mesma base de dados
Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs
As communication protocols evolve, datacenter network utilization increases.
As a result, congestion is more frequent, causing higher latency and packet
loss. Combined with the increasing complexity of workloads, manual design of
congestion control (CC) algorithms becomes extremely difficult. This calls for
the development of AI approaches to replace the human effort. Unfortunately, it
is currently not possible to deploy AI models on network devices due to their
limited computational capabilities. Here, we offer a solution to this problem
by building a computationally-light solution based on a recent reinforcement
learning CC algorithm [arXiv:2207.02295]. We reduce the inference time of RL-CC
by x500 by distilling its complex neural network into decision trees. This
transformation enables real-time inference within the -sec decision-time
requirement, with a negligible effect on quality. We deploy the transformed
policy on NVIDIA NICs in a live cluster. Compared to popular CC algorithms used
in production, RL-CC is the only method that performs well on all benchmarks
tested over a large range of number of flows. It balances multiple metrics
simultaneously: bandwidth, latency, and packet drops. These results suggest
that data-driven methods for CC are feasible, challenging the prior belief that
handcrafted heuristics are necessary to achieve optimal performance
- …