Search CORE

1,701 research outputs found

Effects of component-subscription network topology on large-scale data centre performance scaling

Author: Baek Woongki
Bronson Nathan
Kozyrakis Christos
Olukotun Kunle
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Modern large-scale date centres, such as those used for cloud computing service provision, are becoming ever-larger as the operators of those data centres seek to maximise the benefits from economies of scale. With these increases in size comes a growth in system complexity, which is usually problematic. There is an increased desire for automated "self-star" configuration, management, and failure-recovery of the data-centre infrastructure, but many traditional techniques scale much worse than linearly as the number of nodes to be managed increases. As the number of nodes in a median-sized data-centre looks set to increase by two or three orders of magnitude in coming decades, it seems reasonable to attempt to explore and understand the scaling properties of the data-centre middleware before such data-centres are constructed. In [1] we presented SPECI, a simulator that predicts aspects of large-scale data-centre middleware performance, concentrating on the influence of status changes such as policy updates or routine node failures. [...]. In [1] we used a first-approximation assumption that such subscriptions are distributed wholly at random across the data centre. In this present paper, we explore the effects of introducing more realistic constraints to the structure of the internal network of subscriptions. We contrast the original results [...] exploring the effects of making the data-centre's subscription network have a regular lattice-like structure, and also semi-random network structures resulting from parameterised network generation functions that create "small-world" and "scale-free" networks. We show that for distributed middleware topologies, the structure and distribution of tasks carried out in the data centre can significantly influence the performance overhead imposed by the middleware

arXiv.org e-Print Archive

CiteSeerX

Crossref

ScholarWorks@UNIST

RAMP: A Flat Nanosecond Optical Network and MPI Operations for Distributed Deep Learning Systems

Author: Benjamin Joshua
Ottino Alessandro
Zervas Georgios
Publication venue
Publication date: 28/11/2022
Field of study

Distributed deep learning (DDL) systems strongly depend on network performance. Current electronic packet switched (EPS) network architectures and technologies suffer from variable diameter topologies, low-bisection bandwidth and over-subscription affecting completion time of communication and collective operations. We introduce a near-exascale, full-bisection bandwidth, all-to-all, single-hop, all-optical network architecture with nanosecond reconfiguration called RAMP, which supports large-scale distributed and parallel computing systems (12.8~Tbps per node for up to 65,536 nodes). For the first time, a custom RAMP-x MPI strategy and a network transcoder is proposed to run MPI collective operations across the optical circuit switched (OCS) network in a schedule-less and contention-less manner. RAMP achieves 7.6-171

\times

speed-up in completion time across all MPI operations compared to realistic EPS and OCS counterparts. It can also deliver a 1.3-16

\times

and 7.8-58

\times

reduction in Megatron and DLRM training time respectively} while offering 42-53

\times

and 3.3-12.4

\times

improvement in energy consumption and cost respectively

arXiv.org e-Print Archive

UCL Discovery

Comparison of Cloud Middleware Protocols and Subscription Network Topologies using CReST, the Cloud Research Simulation Toolkit

Author: Cartlidge John
Cliff Dave
Publication venue: 'Scitepress'
Publication date: 08/05/2013
Field of study

Explore Bristol Research

Final Project conclusions

Author: Allmen L., von
Ambroggi F., de
Andree M.
Ansari J.
Attalah L.
Bennebroek M.
Blasi D.
Bosisio A.
Bouwens F.
Carenini A.
Corongiu A.
Cugola G.
Decotignie J.D.
Flügel C.
Fohler G.
Fraboulet A.
Garcia O.
Gomez L.
Hauspie M.
Hogewerf P.H.
Ingelrest F.
Karl H.
Lachenman A.
Lo B.
Lokhorst C.
Lukkien J.
Neugebauer M.
Oliver R.
Riemer B.
Schuster M.
Steine M.
Stocklöw C.
Stok P., van der
Verhoeven R.
Publication venue: Philips
Publication date
Field of study

Wageningen University & Research Publications

DeMMon Decentralized Management and Monitoring Framework

Author: Morais Nuno
Publication venue
Publication date: 01/11/2021
Field of study

The centralized model proposed by the Cloud computing paradigm mismatches the decentralized nature of mobile and IoT applications, given the fact that most of the data production and consumption is performed by end-user devices outside of the Data Center (DC). As the number of these devices grows, and given the need to transport data to and from DCs for computation, application providers incur additional infrastructure costs, and end-users incur delays when performing operations. These reasons have led us into a post-cloud era, where a new computing paradigm arose: Edge Computing. Edge Computing takes into account the broad spectrum of devices residing outside of the DC, closer to the clients, as potential targets for computations, potentially reducing infrastructure costs, improving the quality of service (QoS) for end-users and allowing new interaction paradigms between users and applications. Managing and monitoring the execution of these devices raises new challenges previously unaddressed by Cloud computing, given the scale of these systems and the devices’ (potentially) unreliable data connections and heterogenous computational power. The study of the state-of-the-art has revealed that existing resource monitoring and management solutions require manual configuration and have centralized components, which we believe do not scale for larger-scale systems. In this work, we address these limitations by presenting a novel Decentralized Management and Monitoring (“DeMMon”) system, targeted for edge settings. DeMMon provides primitives to ease the development of tools that manage computational resources that support edge-enabled applications, decomposed in components, through decentralized actions, taking advantage of partial knowledge of the system. Our solution was evaluated to amount to its benefits regarding information dissemination and monitoring capabilities across a set of realistic emulated scenarios of up to 750 nodes with variable failure rates. The results show the validity of our approach and that it can outperform state-of-the-art solutions regarding scalability and reliabilityO modelo centralizado de computação utilizado no paradigma da Computação na Nuvem apresenta limitações no contexto de aplicações no domínio da Internet das Coisas e aplicações móveis. Neste tipo de aplicações, os dados são produzidos e consumidos maioritariamente por dispositivos que se encontram na periferia da rede. Desta forma, transportar estes dados de e para os centros de dados impõe uma carga excessiva nas infraestruturas de rede que ligam os dispositivos aos centros de dados, aumentando a latência de respostas e diminuindo a qualidade de serviço para os utilizadores. Para combater estas limitações, surgiu o paradigma da Computação na Periferia, este paradigma propõe a execução de computações, e potencialmente armazenamento de dados, em dispositivos fora dos centros de dados, mais perto dos clientes, reduzindo custos e criando um novo leque de possibilidades para efetuar computações distribuídas mais próximas dos dispositivos que produzem e consomem os dados. Contudo, gerir e supervisionar a execução desses dispositivos levanta obstáculos não equacionados pela Computação na Nuvem, como a escala destes sistemas, ou a variabilidade na conectividade e na capacidade de computação dos dispositivos que os compõem. O estudo da literatura revela que ferramentas populares para gerir e supervisionar aplicações e dispositivos possuem limitações para a sua escalabilidade, como por exemplo, pontos de falha centralizados, ou requerem a configuração manual de cada dispositivo. Nesta dissertação, propõem-se uma nova solução de monitorização e disseminação de informação descentralizada. Esta solução oferece operações que permitem recolher informação sobre o estado do sistema, de modo a ser utilizada por soluções (também descentralizadas) que gerem aplicações especializadas para executar na periferia da rede. A nossa solução foi avaliada em redes emuladas de várias dimensões com um máximo de 750 nós, no contexto de disseminação e de monitorização de informação. Os nossos resultados mostram que o nosso sistema consegue ser mais robusto ao mesmo tempo que é mais escalável quando comparado com o estado da arte

Repositório da Universidade Nova de Lisboa

Optical Technologies and Control Methods for Scalable Data Centre Networks

Author: Yuan Hui
Publication venue: UCL (University College London)
Publication date: 28/02/2020
Field of study

Attributing to the increasing adoption of cloud services, video services and associated machine learning applications, the traffic demand inside data centers is increasing exponentially, which necessitates an innovated networking infrastructure with high scalability and cost-efficiency. As a promising candidate to provide high capacity, low latency, cost-effective and scalable interconnections, optical technologies have been introduced to data center networks (DCNs) for approximately a decade. To further improve the DCN performance to meet the increasing traffic demand by using photonic technologies, two current trends are a)increasing the bandwidth density of the transmission links and b) maximizing IT and network resources utilization through disaggregated topologies and architectures. Therefore, this PhD thesis focuses on introducing and applying advanced and efficient technologies in these two fields to DCNs to improve their performance. On the one hand, at the link level, since the traditional single-mode fiber (SMF) solutions based on wavelength division multiplexing (WDM) over C+L band may fall short in satisfying the capacity, front panel density, power consumption, and cost requirements of high-performance DCNs, a space division multiplexing (SDM) based DCN using homogeneous multi-core fibers (MCFs) is proposed.With the exploited bi-directional model and proposed spectrum allocation algorithms, the proposed DCN shows great benefits over the SMF solution in terms of network capacity and spatial efficiency. In the meanwhile, it is found that the inter-core crosstalk (IC-XT) between the adjacent cores inside the MCF is dynamic rather than static, therefore, the behaviour of the IC-XT is experimentally investigated under different transmission conditions. On the other hand, an optically disaggregated DCN is developed and to ensure the performance of it, different architectures, topologies, resource routing and allocation algorithms are proposed and compared. Compared to the traditional server-based DCN, the resource utilization, scalability and the cost-efficiency are significantly improved

UCL Discovery

Lessons learned from application prototype experiments in WASP

Author: Attalah L.
Bennebroek M.
Bosisio A.
Colabufo A.
Corongiu A.
Gomez L.
Hogewerf P.H.
Ingelrest F.
Ipema A.H.
Lachenman A.
Lokhorst C.
Mol R.M., de
Sarri J.
Schuster M.
Steine M.
Thiem L.
Verhoeven R.
Publication venue: Philips
Publication date
Field of study

Wageningen University & Research Publications