6 research outputs found

    Optimizing MPI Collective Operations for Cloud Deployments

    Get PDF
    Cloud infrastructures are increasingly being adopted as a platform for high performance computing (HPC) science and engineering applications. For HPC applications, the Message-Passing Interface (MPI) is widely-used. Among MPI operations, collective operations are the most I/O intensive and performance critical. However, classical MPI implementations are inefficient on cloud infrastructures because they are implemented at the application layer using network-oblivious communication patterns. These patterns do not differentiate between local or cross-rack communication and hence do not exploit the inherent locality between processes collocated on the same node or the same rack of nodes. Consequently, they can suffer from high network overheads when communicating across racks. In this thesis, we present COOL, a simple and generic approach for Message-Passing Interface (MPI) collective operations. COOL enables highly efficient designs for collective operations in the cloud. We then present a system design based on COOL that describes how to implement frequently used collective operations. Our design efficiently uses the intra-rack network while significantly reducing cross-rack communication, thus improving application performance and scalability. We use software-defined networking capabilities to build more efficient network paths for I/O intensive collective operations. Our analytic evaluation shows that our design significantly reduces the network overhead across racks. Furthermore, when compared with OpenMPI and MPICH, our design reduces the latency of collective operations by a factor of log N, where N is the total number of processes, decreases the number of exchanged messages by a factor of N and reduces the network load by up to an order of magnitude. These significant improvements come at the cost of a small increase in the computation load on a few processes

    ACiS: smart switches with application-level acceleration

    Full text link
    Network performance has contributed fundamentally to the growth of supercomputing over the past decades. In parallel, High Performance Computing (HPC) peak performance has depended, first, on ever faster/denser CPUs, and then, just on increasing density alone. As operating frequency, and now feature size, have levelled off, two new approaches are becoming central to achieving higher net performance: configurability and integration. Configurability enables hardware to map to the application, as well as vice versa. Integration enables system components that have generally been single function-e.g., a network to transport data—to have additional functionality, e.g., also to operate on that data. More generally, integration enables compute-everywhere: not just in CPU and accelerator, but also in network and, more specifically, the communication switches. In this thesis, we propose four novel methods of enhancing HPC performance through Advanced Computing in the Switch (ACiS). More specifically, we propose various flexible and application-aware accelerators that can be embedded into or attached to existing communication switches to improve the performance and scalability of HPC and Machine Learning (ML) applications. We follow a modular design discipline through introducing composable plugins to successively add ACiS capabilities. In the first work, we propose an inline accelerator to communication switches for user-definable collective operations. MPI collective operations can often be performance killers in HPC applications; we seek to solve this bottleneck by offloading them to reconfigurable hardware within the switch itself. We also introduce a novel mechanism that enables the hardware to support MPI communicators of arbitrary shape and that is scalable to very large systems. In the second work, we propose a look-aside accelerator for communication switches that is capable of processing packets at line-rate. Functions requiring loops and states are addressed in this method. The proposed in-switch accelerator is based on a RISC-V compatible Coarse Grained Reconfigurable Arrays (CGRAs). To facilitate usability, we have developed a framework to compile user-provided C/C++ codes to appropriate back-end instructions for configuring the accelerator. In the third work, we extend ACiS to support fused collectives and the combining of collectives with map operations. We observe that there is an opportunity of fusing communication (collectives) with computation. Since the computation can vary for different applications, ACiS support should be programmable in this method. In the fourth work, we propose that switches with ACiS support can control and manage the execution of applications, i.e., that the switch be an active device with decision-making capabilities. Switches have a central view of the network; they can collect telemetry information and monitor application behavior and then use this information for control, decision-making, and coordination of nodes. We evaluate the feasibility of ACiS through extensive RTL-based simulation as well as deployment in an open-access cloud infrastructure. Using this simulation framework, when considering a Graph Convolutional Network (GCN) application as a case study, a speedup of on average 3.4x across five real-world datasets is achieved on 24 nodes compared to a CPU cluster without ACiS capabilities

    Configurable data center switch architectures

    Get PDF
    In this thesis, we explore alternative architectures for implementing con_gurable Data Center Switches along with the advantages that can be provided by such switches. Our first contribution centers around determining switch architectures that can be implemented on Field Programmable Gate Array (FPGA) to provide configurable switching protocols. In the process, we identify a gap in the availability of frameworks to realistically evaluate the performance of switch architectures in data centers and contribute a simulation framework that relies on realistic data center traffic patterns. Our framework is then used to evaluate the performance of currently existing as well as newly proposed FPGA-amenable switch designs. Through collaborative work with Meng and Papaphilippou, we establish that only small-medium range switches can be implemented on today's FPGAs. Our second contribution is a novel switch architecture that integrates a custom in-network hardware accelerator with a generic switch to accelerate Deep Neural Network training applications in data centers. Our proposed accelerator architecture is prototyped on an FPGA, and a scalability study is conducted to demonstrate the trade-offs of an FPGA implementation when compared to an ASIC implementation. In addition to the hardware prototype, we contribute a light weight load-balancing and congestion control protocol that leverages the unique communication patterns of ML data-parallel jobs to enable fair sharing of network resources across different jobs. Our large-scale simulations demonstrate the ability of our novel switch architecture and light weight congestion control protocol to both accelerate the training time of machine learning jobs by up to 1.34x and benefit other latency-sensitive applications by reducing their 99%-tile completion time by up to 4.5x. As for our final contribution, we identify the main requirements of in-network applications and propose a Network-on-Chip (NoC)-based architecture for supporting a heterogeneous set of applications. Observing the lack of tools to support such research, we provide a tool that can be used to evaluate NoC-based switch architectures.Open Acces

    Communication patterns abstractions for programming SDN to optimize high-performance computing applications

    Get PDF
    Orientador : Luis Carlos Erpen de BonaCoorientadores : Magnos Martinello; Marcos Didonet Del FabroTese (doutorado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa: Curitiba, 04/09/2017Inclui referências : f. 95-113Resumo: A evolução da computação e das redes permitiu que múltiplos computadores fossem interconectados, agregando seus poderes de processamento para formar uma computação de alto desempenho (HPC). As aplicações que são executadas nesses ambientes processam enormes quantidades de informação, podendo levar várias horas ou até dias para completar suas execuções, motivando pesquisadores de varias áreas computacionais a estudar diferentes maneiras para acelerá-las. Durante o processamento, essas aplicações trocam grandes quantidades de dados entre os computadores, fazendo que a rede se torne um gargalo. A rede era considerada um recurso estático, não permitindo modificações dinâmicas para otimizar seus links ou dispositivos. Porém, as redes definidas por software (SDN) emergiram como um novo paradigma, permitindoa ser reprogramada de acordo com os requisitos dos usuários. SDN já foi usado para otimizar a rede para aplicações HPC específicas mas nenhum trabalho tira proveito dos padrões de comunicação expressos por elas. Então, o principal objetivo desta tese é pesquisar como esses padrões podem ser usados para ajustar a rede, criando novas abstrações para programá-la, visando acelerar as aplicações HPC. Para atingir esse objetivo, nós primeiramente pesquisamos todos os níveis de programabilidade do SDN. Este estudo resultou na nossa primeira contribuição, a criação de uma taxonomia para agrupar as abstrações de alto nível oferecidas pelas linguagens de programação SDN. Em seguida, nós investigamos os padrões de comunicação das aplicações HPC, observando seus comportamentos espaciais e temporais através da análise de suas matrizes de tráfego (TMs). Concluímos que as TMs podem representar as comunicações, além disso, percebemos que as aplicações tendem a transmitir as mesmas quantidades de dados entre os mesmos nós computacionais. A segunda contribuição desta tese é o desenvolvimento de um framework que permite evitar os fatores da rede que podem degradar o desempenho das aplicações, tais como, sobrecarga imposta pela topologia, o desbalanceamento na utilização dos links e problemas introduzidos pela programabilidade do SDN. O framework disponibiliza uma API e mantém uma base de dados de TMs, uma para cada padrão de comunicação, anotadas com restrições de largura de banda e latência. Essas informações são usadas para reprogramar os dispositivos da rede, alocando uniformemente as comunicações nos caminhos da rede. Essa abordagem reduziu o tempo de execução de benchmarks e aplicações reais em até 26.5%. Para evitar que o código da aplicação fosse modificado, como terceira contribuição, desenvolvemos um método para identificar automaticamente os padrões de comunicação. Esse método gera texturas visuais di_erentes para cada TM e, através de técnicas de aprendizagem de máquina (ML), identifica as aplicações que estão usando a rede. Em nossos experimentos, o método conseguiu uma taxa de acerto superior a 98%. Finalmente, nós incorporamos esse método ao framework, criando uma abstração que permite programar a rede sem a necessidade de alterar as aplicações HPC, diminuindo em média 15.8% seus tempos de execução. Palavras-chave: Redes Definidas por Software, Padrões de Comunicação, Aplicações HPC.Abstract: The evolution of computing and networking allowed multiple computers to be interconnected, aggregating their processing powers to form a high-performance computing (HPC). Applications that run in these computational environments process huge amounts of information, taking several hours or even days to complete their executions, motivating researchers from various computational fields to study different ways for accelerating them. During the processing, these applications exchange large amounts of data among the computers, causing the network to become a bottleneck. The network was considered a static resource, not allowing dynamic adjustments for optimizing its links or devices. However, Software-Defined Networking (SDN) emerged as a new paradigm, allowing the network to be reprogrammed according to users' requirements. SDN has already been used to optimize the network for specific HPC applications, but no existing work takes advantage of the communication patterns expressed by those applications. So, the main objective of this thesis is to research how these patterns can be used for tuning the network, creating new abstractions for programming it, aiming to speed up HPC applications. To achieve this goal, we first surveyed all SDN programmability levels. This study resulted in our first contribution, the creation of a taxonomy for grouping the high-level abstractions offered by SDN programming languages. Next, we investigated the communication patterns of HPC applications, observing their spatial and temporal behaviors by analyzing their traffic matrices (TMs). We conclude that TMs can represent the communications, furthermore, we realize that the applications tend to transmit the same amount of data among the same computational nodes. The second contribution of this thesis is the development of a framework for avoiding the network factors that can degrade the performance of applications, such as topology overhead, unbalanced links, and issues introduced by the SDN programmability. The framework provides an API and maintains a database of TMs, one for each communication pattern, annotated with bandwidth and latency constraints. This information is used to reprogram network devices, evenly placing the communications on the network paths. This approach reduced the execution time of benchmarks and real applications up to 26.5%. To prevent the application's source code to be modified, as a third contribution of our work, we developed a method to automatically identify the communication patterns. This method generates different visual textures for each TM and, through machine learning (ML) techniques, identifies the applications using the network. In our experiments the method succeeded with an accuracy rate over 98%. Finally, we incorporate this method into the framework, creating an abstraction that allows programming the network without changing the HPC applications, reducing on average 15.8% their execution times. Keywords: Software-Defined Networking, Communication Patterns, HPC Applications

    Performance evaluation of SDN-enhanced MPI allreduce on a cluster system with fat-tree interconnect

    No full text

    Hyperscale Data Processing With Network-Centric Designs

    Get PDF
    Today’s largest data processing workloads are hosted in cloud data centers. Due to unprecedented data growth and the end of Moore’s Law, these workloads have ballooned to the hyperscale level, encompassing billions to trillions of data items and hundreds to thousands of machines per query. Enabling and expanding with these workloads are highly scalable data center networks that connect up to hundreds of thousands of networked servers. These massive scales fundamentally challenge the designs of both data processing systems and data center networks, and the classic layered designs are no longer sustainable. Rather than optimize these massive layers in silos, we build systems across them with principled network-centric designs. In current networks, we redesign data processing systems with network-awareness to minimize the cost of moving data in the network. In future networks, we propose new interfaces and services that the cloud infrastructure offers to applications and codesign data processing systems to achieve optimal query processing performance. To transform the network to future designs, we facilitate network innovation at scale. This dissertation presents a line of systems work that covers all three directions. It first discusses GraphRex, a network-aware system that combines classic database and systems techniques to push the performance of massive graph queries in current data centers. It then introduces data processing in disaggregated data centers, a promising new cloud proposal. It details TELEPORT, a compute pushdown feature that eliminates data processing performance bottlenecks in disaggregated data centers, and Redy, which provides high-performance caches using remote disaggregated memory. Finally, it presents MimicNet, a fine-grained simulation framework that evaluates network proposals at datacenter scale with machine learning approximation. These systems demonstrate that our ideas in network-centric designs achieve orders of magnitude higher efficiency compared to the state of the art at hyperscale
    corecore