1,701 research outputs found
Effects of component-subscription network topology on large-scale data centre performance scaling
Modern large-scale date centres, such as those used for cloud computing
service provision, are becoming ever-larger as the operators of those data
centres seek to maximise the benefits from economies of scale. With these
increases in size comes a growth in system complexity, which is usually
problematic. There is an increased desire for automated "self-star"
configuration, management, and failure-recovery of the data-centre
infrastructure, but many traditional techniques scale much worse than linearly
as the number of nodes to be managed increases. As the number of nodes in a
median-sized data-centre looks set to increase by two or three orders of
magnitude in coming decades, it seems reasonable to attempt to explore and
understand the scaling properties of the data-centre middleware before such
data-centres are constructed. In [1] we presented SPECI, a simulator that
predicts aspects of large-scale data-centre middleware performance,
concentrating on the influence of status changes such as policy updates or
routine node failures. [...]. In [1] we used a first-approximation assumption
that such subscriptions are distributed wholly at random across the data
centre. In this present paper, we explore the effects of introducing more
realistic constraints to the structure of the internal network of
subscriptions. We contrast the original results [...] exploring the effects of
making the data-centre's subscription network have a regular lattice-like
structure, and also semi-random network structures resulting from parameterised
network generation functions that create "small-world" and "scale-free"
networks. We show that for distributed middleware topologies, the structure and
distribution of tasks carried out in the data centre can significantly
influence the performance overhead imposed by the middleware
RAMP: A Flat Nanosecond Optical Network and MPI Operations for Distributed Deep Learning Systems
Distributed deep learning (DDL) systems strongly depend on network
performance. Current electronic packet switched (EPS) network architectures and
technologies suffer from variable diameter topologies, low-bisection bandwidth
and over-subscription affecting completion time of communication and collective
operations.
We introduce a near-exascale, full-bisection bandwidth, all-to-all,
single-hop, all-optical network architecture with nanosecond reconfiguration
called RAMP, which supports large-scale distributed and parallel computing
systems (12.8~Tbps per node for up to 65,536 nodes).
For the first time, a custom RAMP-x MPI strategy and a network transcoder is
proposed to run MPI collective operations across the optical circuit switched
(OCS) network in a schedule-less and contention-less manner. RAMP achieves
7.6-171 speed-up in completion time across all MPI operations compared
to realistic EPS and OCS counterparts. It can also deliver a 1.3-16 and
7.8-58 reduction in Megatron and DLRM training time respectively} while
offering 42-53 and 3.3-12.4 improvement in energy consumption
and cost respectively
DeMMon Decentralized Management and Monitoring Framework
The centralized model proposed by the Cloud computing paradigm mismatches the decentralized
nature of mobile and IoT applications, given the fact that most of the data
production and consumption is performed by end-user devices outside of the Data Center
(DC). As the number of these devices grows, and given the need to transport data to and
from DCs for computation, application providers incur additional infrastructure costs,
and end-users incur delays when performing operations.
These reasons have led us into a post-cloud era, where a new computing paradigm
arose: Edge Computing. Edge Computing takes into account the broad spectrum of
devices residing outside of the DC, closer to the clients, as potential targets for computations,
potentially reducing infrastructure costs, improving the quality of service (QoS)
for end-users and allowing new interaction paradigms between users and applications.
Managing and monitoring the execution of these devices raises new challenges previously
unaddressed by Cloud computing, given the scale of these systems and the devicesâ
(potentially) unreliable data connections and heterogenous computational power. The
study of the state-of-the-art has revealed that existing resource monitoring and management
solutions require manual configuration and have centralized components, which
we believe do not scale for larger-scale systems.
In this work, we address these limitations by presenting a novel Decentralized Management
and Monitoring (âDeMMonâ) system, targeted for edge settings. DeMMon provides
primitives to ease the development of tools that manage computational resources
that support edge-enabled applications, decomposed in components, through decentralized
actions, taking advantage of partial knowledge of the system. Our solution was
evaluated to amount to its benefits regarding information dissemination and monitoring
capabilities across a set of realistic emulated scenarios of up to 750 nodes with variable
failure rates. The results show the validity of our approach and that it can outperform
state-of-the-art solutions regarding scalability and reliabilityO modelo centralizado de computação utilizado no paradigma da Computação na Nuvem
apresenta limitaçÔes no contexto de aplicaçÔes no domĂnio da Internet das Coisas
e aplicaçÔes móveis. Neste tipo de aplicaçÔes, os dados são produzidos e consumidos
maioritariamente por dispositivos que se encontram na periferia da rede. Desta forma,
transportar estes dados de e para os centros de dados impÔe uma carga excessiva nas
infraestruturas de rede que ligam os dispositivos aos centros de dados, aumentando a
latĂȘncia de respostas e diminuindo a qualidade de serviço para os utilizadores.
Para combater estas limitaçÔes, surgiu o paradigma da Computação na Periferia, este
paradigma propÔe a execução de computaçÔes, e potencialmente armazenamento de
dados, em dispositivos fora dos centros de dados, mais perto dos clientes, reduzindo
custos e criando um novo leque de possibilidades para efetuar computaçÔes distribuĂdas
mais prĂłximas dos dispositivos que produzem e consomem os dados.
Contudo, gerir e supervisionar a execução desses dispositivos levanta obståculos não
equacionados pela Computação na Nuvem, como a escala destes sistemas, ou a variabilidade
na conectividade e na capacidade de computação dos dispositivos que os compÔem.
O estudo da literatura revela que ferramentas populares para gerir e supervisionar aplicaçÔes
e dispositivos possuem limitaçÔes para a sua escalabilidade, como por exemplo,
pontos de falha centralizados, ou requerem a configuração manual de cada dispositivo.
Nesta dissertação, propÔem-se uma nova solução de monitorização e disseminação
de informação descentralizada. Esta solução oferece operaçÔes que permitem recolher
informação sobre o estado do sistema, de modo a ser utilizada por soluçÔes (também
descentralizadas) que gerem aplicaçÔes especializadas para executar na periferia da rede.
A nossa solução foi avaliada em redes emuladas de vårias dimensÔes com um måximo
de 750 nós, no contexto de disseminação e de monitorização de informação. Os nossos
resultados mostram que o nosso sistema consegue ser mais robusto ao mesmo tempo que
Ă© mais escalĂĄvel quando comparado com o estado da arte
Optical Technologies and Control Methods for Scalable Data Centre Networks
Attributing to the increasing adoption of cloud services, video services and associated machine learning applications, the traffic demand inside data centers is increasing exponentially, which necessitates an innovated networking infrastructure with high scalability and cost-efficiency. As a promising candidate to provide high capacity, low latency, cost-effective and scalable interconnections, optical technologies have been introduced to data center networks (DCNs) for approximately a decade. To further improve the DCN performance to meet the increasing traffic demand by using photonic technologies, two current trends are a)increasing the bandwidth density of the transmission links and b) maximizing IT and network resources utilization through disaggregated topologies and architectures. Therefore, this PhD thesis focuses on introducing and applying advanced and efficient technologies in these two fields to DCNs to improve their performance. On the one hand, at the link level, since the traditional single-mode fiber (SMF) solutions based on wavelength division multiplexing (WDM) over C+L band may fall short in satisfying the capacity, front panel density, power consumption, and cost requirements of high-performance DCNs, a space division multiplexing (SDM) based DCN using homogeneous multi-core fibers (MCFs) is proposed.With the exploited bi-directional model and proposed spectrum allocation algorithms, the proposed DCN shows great benefits over the SMF solution in terms of network capacity and spatial efficiency. In the meanwhile, it is found that the inter-core crosstalk (IC-XT) between the adjacent cores inside the MCF is dynamic rather than static, therefore, the behaviour of the IC-XT is experimentally investigated under different transmission conditions. On the other hand, an optically disaggregated DCN is developed and to ensure the performance of it, different architectures, topologies, resource routing and allocation algorithms are proposed and compared. Compared to the traditional server-based DCN, the resource utilization, scalability and the cost-efficiency are significantly improved
- âŠ