1,701 research outputs found

    Effects of component-subscription network topology on large-scale data centre performance scaling

    Full text link
    Modern large-scale date centres, such as those used for cloud computing service provision, are becoming ever-larger as the operators of those data centres seek to maximise the benefits from economies of scale. With these increases in size comes a growth in system complexity, which is usually problematic. There is an increased desire for automated "self-star" configuration, management, and failure-recovery of the data-centre infrastructure, but many traditional techniques scale much worse than linearly as the number of nodes to be managed increases. As the number of nodes in a median-sized data-centre looks set to increase by two or three orders of magnitude in coming decades, it seems reasonable to attempt to explore and understand the scaling properties of the data-centre middleware before such data-centres are constructed. In [1] we presented SPECI, a simulator that predicts aspects of large-scale data-centre middleware performance, concentrating on the influence of status changes such as policy updates or routine node failures. [...]. In [1] we used a first-approximation assumption that such subscriptions are distributed wholly at random across the data centre. In this present paper, we explore the effects of introducing more realistic constraints to the structure of the internal network of subscriptions. We contrast the original results [...] exploring the effects of making the data-centre's subscription network have a regular lattice-like structure, and also semi-random network structures resulting from parameterised network generation functions that create "small-world" and "scale-free" networks. We show that for distributed middleware topologies, the structure and distribution of tasks carried out in the data centre can significantly influence the performance overhead imposed by the middleware

    RAMP: A Flat Nanosecond Optical Network and MPI Operations for Distributed Deep Learning Systems

    Full text link
    Distributed deep learning (DDL) systems strongly depend on network performance. Current electronic packet switched (EPS) network architectures and technologies suffer from variable diameter topologies, low-bisection bandwidth and over-subscription affecting completion time of communication and collective operations. We introduce a near-exascale, full-bisection bandwidth, all-to-all, single-hop, all-optical network architecture with nanosecond reconfiguration called RAMP, which supports large-scale distributed and parallel computing systems (12.8~Tbps per node for up to 65,536 nodes). For the first time, a custom RAMP-x MPI strategy and a network transcoder is proposed to run MPI collective operations across the optical circuit switched (OCS) network in a schedule-less and contention-less manner. RAMP achieves 7.6-171×\times speed-up in completion time across all MPI operations compared to realistic EPS and OCS counterparts. It can also deliver a 1.3-16×\times and 7.8-58×\times reduction in Megatron and DLRM training time respectively} while offering 42-53×\times and 3.3-12.4×\times improvement in energy consumption and cost respectively

    DeMMon Decentralized Management and Monitoring Framework

    Get PDF
    The centralized model proposed by the Cloud computing paradigm mismatches the decentralized nature of mobile and IoT applications, given the fact that most of the data production and consumption is performed by end-user devices outside of the Data Center (DC). As the number of these devices grows, and given the need to transport data to and from DCs for computation, application providers incur additional infrastructure costs, and end-users incur delays when performing operations. These reasons have led us into a post-cloud era, where a new computing paradigm arose: Edge Computing. Edge Computing takes into account the broad spectrum of devices residing outside of the DC, closer to the clients, as potential targets for computations, potentially reducing infrastructure costs, improving the quality of service (QoS) for end-users and allowing new interaction paradigms between users and applications. Managing and monitoring the execution of these devices raises new challenges previously unaddressed by Cloud computing, given the scale of these systems and the devices’ (potentially) unreliable data connections and heterogenous computational power. The study of the state-of-the-art has revealed that existing resource monitoring and management solutions require manual configuration and have centralized components, which we believe do not scale for larger-scale systems. In this work, we address these limitations by presenting a novel Decentralized Management and Monitoring (“DeMMon”) system, targeted for edge settings. DeMMon provides primitives to ease the development of tools that manage computational resources that support edge-enabled applications, decomposed in components, through decentralized actions, taking advantage of partial knowledge of the system. Our solution was evaluated to amount to its benefits regarding information dissemination and monitoring capabilities across a set of realistic emulated scenarios of up to 750 nodes with variable failure rates. The results show the validity of our approach and that it can outperform state-of-the-art solutions regarding scalability and reliabilityO modelo centralizado de computação utilizado no paradigma da Computação na Nuvem apresenta limitaçÔes no contexto de aplicaçÔes no domĂ­nio da Internet das Coisas e aplicaçÔes mĂłveis. Neste tipo de aplicaçÔes, os dados sĂŁo produzidos e consumidos maioritariamente por dispositivos que se encontram na periferia da rede. Desta forma, transportar estes dados de e para os centros de dados impĂ”e uma carga excessiva nas infraestruturas de rede que ligam os dispositivos aos centros de dados, aumentando a latĂȘncia de respostas e diminuindo a qualidade de serviço para os utilizadores. Para combater estas limitaçÔes, surgiu o paradigma da Computação na Periferia, este paradigma propĂ”e a execução de computaçÔes, e potencialmente armazenamento de dados, em dispositivos fora dos centros de dados, mais perto dos clientes, reduzindo custos e criando um novo leque de possibilidades para efetuar computaçÔes distribuĂ­das mais prĂłximas dos dispositivos que produzem e consomem os dados. Contudo, gerir e supervisionar a execução desses dispositivos levanta obstĂĄculos nĂŁo equacionados pela Computação na Nuvem, como a escala destes sistemas, ou a variabilidade na conectividade e na capacidade de computação dos dispositivos que os compĂ”em. O estudo da literatura revela que ferramentas populares para gerir e supervisionar aplicaçÔes e dispositivos possuem limitaçÔes para a sua escalabilidade, como por exemplo, pontos de falha centralizados, ou requerem a configuração manual de cada dispositivo. Nesta dissertação, propĂ”em-se uma nova solução de monitorização e disseminação de informação descentralizada. Esta solução oferece operaçÔes que permitem recolher informação sobre o estado do sistema, de modo a ser utilizada por soluçÔes (tambĂ©m descentralizadas) que gerem aplicaçÔes especializadas para executar na periferia da rede. A nossa solução foi avaliada em redes emuladas de vĂĄrias dimensĂ”es com um mĂĄximo de 750 nĂłs, no contexto de disseminação e de monitorização de informação. Os nossos resultados mostram que o nosso sistema consegue ser mais robusto ao mesmo tempo que Ă© mais escalĂĄvel quando comparado com o estado da arte

    Optical Technologies and Control Methods for Scalable Data Centre Networks

    Get PDF
    Attributing to the increasing adoption of cloud services, video services and associated machine learning applications, the traffic demand inside data centers is increasing exponentially, which necessitates an innovated networking infrastructure with high scalability and cost-efficiency. As a promising candidate to provide high capacity, low latency, cost-effective and scalable interconnections, optical technologies have been introduced to data center networks (DCNs) for approximately a decade. To further improve the DCN performance to meet the increasing traffic demand by using photonic technologies, two current trends are a)increasing the bandwidth density of the transmission links and b) maximizing IT and network resources utilization through disaggregated topologies and architectures. Therefore, this PhD thesis focuses on introducing and applying advanced and efficient technologies in these two fields to DCNs to improve their performance. On the one hand, at the link level, since the traditional single-mode fiber (SMF) solutions based on wavelength division multiplexing (WDM) over C+L band may fall short in satisfying the capacity, front panel density, power consumption, and cost requirements of high-performance DCNs, a space division multiplexing (SDM) based DCN using homogeneous multi-core fibers (MCFs) is proposed.With the exploited bi-directional model and proposed spectrum allocation algorithms, the proposed DCN shows great benefits over the SMF solution in terms of network capacity and spatial efficiency. In the meanwhile, it is found that the inter-core crosstalk (IC-XT) between the adjacent cores inside the MCF is dynamic rather than static, therefore, the behaviour of the IC-XT is experimentally investigated under different transmission conditions. On the other hand, an optically disaggregated DCN is developed and to ensure the performance of it, different architectures, topologies, resource routing and allocation algorithms are proposed and compared. Compared to the traditional server-based DCN, the resource utilization, scalability and the cost-efficiency are significantly improved
    • 

    corecore