29 research outputs found

    Routing on the Channel Dependency Graph:: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks

    Get PDF
    In the pursuit for ever-increasing compute power, and with Moore's law slowly coming to an end, high-performance computing started to scale-out to larger systems. Alongside the increasing system size, the interconnection network is growing to accommodate and connect tens of thousands of compute nodes. These networks have a large influence on total cost, application performance, energy consumption, and overall system efficiency of the supercomputer. Unfortunately, state-of-the-art routing algorithms, which define the packet paths through the network, do not utilize this important resource efficiently. Topology-aware routing algorithms become increasingly inapplicable, due to irregular topologies, which either are irregular by design, or most often a result of hardware failures. Exchanging faulty network components potentially requires whole system downtime further increasing the cost of the failure. This management approach becomes more and more impractical due to the scale of today's networks and the accompanying steady decrease of the mean time between failures. Alternative methods of operating and maintaining these high-performance interconnects, both in terms of hardware- and software-management, are necessary to mitigate negative effects experienced by scientific applications executed on the supercomputer. However, existing topology-agnostic routing algorithms either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables. Using the fail-in-place strategy, a well-established method for storage systems to repair only critical component failures, is a feasible solution for current and future HPC interconnects as well as other large-scale installations such as data center networks. Although, an appropriate combination of topology and routing algorithm is required to minimize the throughput degradation for the entire system. This thesis contributes a network simulation toolchain to facilitate the process of finding a suitable combination, either during system design or while it is in operation. On top of this foundation, a key contribution is a novel scheduling-aware routing, which reduces fault-induced throughput degradation while improving overall network utilization. The scheduling-aware routing performs frequent property preserving routing updates to optimize the path balancing for simultaneously running batch jobs. The increased deployment of lossless interconnection networks, in conjunction with fail-in-place modes of operation and topology-agnostic, scheduling-aware routing algorithms, necessitates new solutions to solve the routing-deadlock problem. Therefore, this thesis further advances the state-of-the-art by introducing a novel concept of routing on the channel dependency graph, which allows the design of an universally applicable destination-based routing capable of optimizing the path balancing without exceeding a given number of virtual channels, which are a common hardware limitation. This disruptive innovation enables implicit deadlock-avoidance during path calculation, instead of solving both problems separately as all previous solutions

    FlexVC: Flexible virtual channel management in low-diameter networks

    Get PDF
    Deadlock avoidance mechanisms for lossless lowdistance networks typically increase the order of virtual channel (VC) index with each hop. This restricts the number of buffer resources depending on the routing mechanism and limits performance due to an inefficient use. Dynamic buffer organizations increase implementation complexity and only provide small gains in this context because a significant amount of buffering needs to be allocated statically to avoid congestion. We introduce FlexVC, a simple buffer management mechanism which permits a more flexible use of VCs. It combines statically partitioned buffers, opportunistic routing and a relaxed distancebased deadlock avoidance policy. FlexVC mitigates Head-of-Line blocking and reduces up to 50% the memory requirements. Simulation results in a Dragonfly network show congestion reduction and up to 37.8% throughput improvement, outperforming more complex dynamic approaches. FlexVC merges different flows of traffic in the same buffers, which in some cases makes more difficult to identify the traffic pattern in order to support nonminimal adaptive routing. An alternative denoted FlexVCminCred improves congestion sensing for adaptive routing by tracking separately packets routed minimally and nonminimally, rising throughput up to 20.4% with 25% savings in buffer area.This work has been supported by the Spanish Government (grant SEV2015-0493 of the Severo Ochoa Program), the Spanish Ministry of Economy, Industry and Competitiveness (contracts TIN2015-65316), the Spanish Research Agency (AEI/FEDER, UE - TIN2016-76635-C2-2-R), the Spanish Ministry of Education (FPU grant FPU13/00337), the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014- SGR-1272), the European Union FP7 programme (RoMoL ERC Advanced Grant GA 321253), the European HiPEAC Network of Excellence and the European Union’s Horizon 2020 research and innovation programme (Mont-Blanc project under grant agreement No 671697).Peer ReviewedPostprint (author's final draft

    A High-Performance Design, Implementation, Deployment, and Evaluation of The Slim Fly Network

    Full text link
    Novel low-diameter network topologies such as Slim Fly (SF) offer significant cost and power advantages over the established Fat Tree, Clos, or Dragonfly. To spearhead the adoption of low-diameter networks, we design, implement, deploy, and evaluate the first real-world SF installation. We focus on deployment, management, and operational aspects of our test cluster with 200 servers and carefully analyze performance. We demonstrate techniques for simple cabling and cabling validation as well as a novel high-performance routing architecture for InfiniBand-based low-diameter topologies. Our real-world benchmarks show SF's strong performance for many modern workloads such as deep neural network training, graph analytics, or linear algebra kernels. SF outperforms non-blocking Fat Trees in scalability while offering comparable or better performance and lower cost for large network sizes. Our work can facilitate deploying SF while the associated (open-source) routing architecture is fully portable and applicable to accelerate any low-diameter interconnect

    Topology Agnostic Methods for Routing, Reconfiguration and Virtualization of Interconnection Networks

    Get PDF
    Modern computing systems, such as supercomputers, data centers and multicore chips, generally require efficient communication between their different system units; tolerance towards component faults; flexibility to expand or merge; and a high utilization of their resources. Interconnection networks are used in a variety of such computing systems in order to enable communication between their diverse system units. Investigation and proposal of new or improved solutions to topology agnostic routing and reconfiguration of interconnection networks are main objectives of this thesis. In addition, topology agnostic routing and reconfiguration algorithms are utilized in the development of new and flexible approaches to processor allocation. The thesis aims to present versatile solutions that can be used for the interconnection networks of a number of different computing systems. No particular routing algorithm was specified for an interconnection network technology which is now incorporated in Dolphin Express. The thesis states a set of criteria for a suitable routing algorithm, evaluates a number of existing routing algorithms, and recommend that one of the algorithms – which fulfils all of the criteria – is used. Further investigations demonstrate how this routing algorithm inherently supports fault-tolerance, and how it can be optimized for some network topologies. These considerations are also relevant for the InfiniBand interconnection network technology. Reconfiguration of interconnection networks (change of routing function) is a deadlock prone process. Some existing reconfiguration strategies include deadlock avoidance mechanisms that significantly reduce the network service offered to running applications. The thesis expands the area of application for one of the most versatile and efficient reconfiguration algorithms available in the literature, and proposes an optimization of this algorithm that improves the network service offered to running applications. Moreover, a new reconfiguration algorithm is presented that supports a replacement of the routing function without causing performance penalties. Processor allocation strategies that guarantee traffic-containment commonly pose strict requirements on the shape of partitions, and thus achieve only a limited utilization of a system’s computing resources. The thesis introduces two new approaches that are more flexible. Both approaches utilize the properties of a topology agnostic routing algorithm in order to enforce traffic-containment within arbitrarily shaped partitions. Consequently, a high resource utilization as well as isolation of traffic between different partitions is achieved

    Energy-Efficient Interconnection Networks for High-Performance Computing

    Get PDF
    In recent years, energy has become one of the most important factors for de- signing and operating large scale computing systems. This is particularly true in high-performance computing, where systems often consist of thousands of nodes. Especially after the end of Dennard’s scaling, the demand for energy- proportionality in components, where energy is depending linearly on utilization, increases continuously. As the main contributor to the overall power consumption, processors have received the main attention so far. The increasing energy proportionality of processors, however, shifts the focus to other components such as interconnection networks. Their share of the overall power consumption is expected to increase to 20% or more while other components further increase their efficiency in the near future. Hence, it is crucial to improve energy proportionality in interconnection networks likewise to reduce overall power and energy consumption. To facilitate these attempts, this work provides comprehensive studies about energy saving in interconnection networks at different levels. First, interconnection networks differ fundamentally from other components in their underlying technology. To gain a deeper understanding of these differences and to identify targets for energy savings, this work provides a detailed power analysis of current network hardware. Furthermore, various applications at different scales are analyzed regarding their communication patterns and locality properties. The findings show that communication makes up only a small fraction of the execution time and networks are actually idling most of the time. Another observation is that point-to-point communication often only occurs within various small subsets of all participants, which indicates that a coordinated mapping could further decrease network traffic. Based on these studies, three different energy-saving policies are designed, which all differ in their implementation and focus. Then, these policies are evaluated in an event-based, power-aware network simulator. While two policies that operate completely local at link level, enable significant energy savings of more than 90% in most analyses, the hybrid one does not provide further benefits despite significant additional design effort. Additionally, these studies include network design parameters, such as transition time between different link configurations, as well as the three most common topologies in supercomputing systems. The final part of this work addresses the interactions of congestion management and energy-saving policies. Although both network management strategies aim for different goals and use opposite approaches, they complement each other and can increase energy efficiency in all studies as well as improve the performance overhead as opposed to plain energy saving

    Researching methods for efficient hardware specification, design and implementation of a next generation communication architecture

    Full text link
    The objective of this work is to create and implement a System Area Network (SAN) architecture called EXTOLL embedded in the current world of systems, software and standards based on the experiences obtained during the ATOLL project development and test. The topics of this work also cover system design methodology and educational issues in order to provide appropriate human resources and work premises. The scope of this work in the EXTOLL SAN project was: • the Xbar architecture and routing (multi-layer routing, virtual channels and their arbitration, routing formats, dead lock aviodance, debug features, automation of reuse) • the on-chip module communication architecture and parts of the host communication • the network processor architecture and integration • the development of the design methodology and the creation of the design flow • the team education and work structure. In order to successfully leverage student know-how and work flow methodology for this research project the SEED curricula changes has been governed by the Hochschul Didaktik Zentrum resulting in a certificate for "Hochschuldidaktik" and excellence in university education. The complexity of the target system required new approaches in concurrent Hardware/Software codesign. The concept of virtual hardware prototypes has been established and excessively used during design space exploration and software interface design

    Cost Effective Routing Implementations for On-chip Networks

    Full text link
    Arquitecturas de múltiples núcleos como multiprocesadores (CMP) y soluciones multiprocesador para sistemas dentro del chip (MPSoCs) actuales se basan en la eficacia de las redes dentro del chip (NoC) para la comunicación entre los diversos núcleos. Un diseño eficiente de red dentro del chip debe ser escalable y al mismo tiempo obtener valores ajustados de área, latencia y consumo de energía. Para diseños de red dentro del chip de propósito general se suele usar topologías de malla 2D ya que se ajustan a la distribución del chip. Sin embargo, la aparición de nuevos retos debe ser abordada por los diseñadores. Una mayor probabilidad de defectos de fabricación, la necesidad de un uso optimizado de los recursos para aumentar el paralelismo a nivel de aplicación o la necesidad de técnicas eficaces de ahorro de energía, puede ocasionar patrones de irregularidad en las topologías. Además, el soporte para comunicación colectiva es una característica buscada para abordar con eficacia las necesidades de comunicación de los protocolos de coherencia de caché. En estas condiciones, un encaminamiento eficiente de los mensajes se convierte en un reto a superar. El objetivo de esta tesis es establecer las bases de una nueva arquitectura para encaminamiento distribuido basado en lógica que es capaz de adaptarse a cualquier topología irregular derivada de una estructura de malla 2D, proporcionando así una cobertura total para cualquier caso resultado de soportar los retos mencionados anteriormente. Para conseguirlo, en primer lugar, se parte desde una base, para luego analizar una evolución de varios mecanismos, y finalmente llegar a una implementación, que abarca varios módulos para alcanzar el objetivo mencionado anteriormente. De hecho, esta última implementación tiene por nombre eLBDR (effective Logic-Based Distributed Routing). Este trabajo cubre desde el primer mecanismo, LBDR, hasta el resto de mecanismos que han surgido progresivamente.Rodrigo Mocholí, S. (2010). Cost Effective Routing Implementations for On-chip Networks [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8962Palanci

    Profilage et débogage par prise de traces efficaces d'applications hybrides multi-threadées HPC

    Get PDF
    Supercomputers’ evolution is at the source of both hardware and software challenges. In the quest for the highest computing power, the interdependence in-between simulation components is becoming more and more impacting, requiring new approaches. This thesis is focused on the software development aspect and particularly on the observation of parallel software when being run on several thousand cores. This observation aims at providing developers with the necessary feedback when running a program on an execution substrate which has not been modeled yet because of its complexity. In this purpose, we firstly introduce the development process from a global point of view, before describing developer tools and related work. In a second time, we present our contribution which consists in a trace based profiling and debugging tool and its evolution towards an on-line coupling method which as we will show is more scalable as it overcomes IOs limitations. Our contribution also covers our time-stamp synchronisation algorithm for tracing purposes which relies on a probabilistic approach with quantified error. We also present a tool allowing machine characterisation from the MPI aspect and demonstrate the presence of machine noise for both point to point and collectives, justifying the use of an empirical approach. In summary, this work proposes and motivates an alternative approach to trace based event collection while preserving event granularity and a reduced overheadL’évolution des supercalculateurs est à la source de défis logiciels et architecturaux. Dans la quête de puissance de calcul, l’interdépendance des éléments du processus de simulation devient de plus en plus impactante et requiert de nouvelles approches. Cette thèse se concentre sur le développement logiciel et particulièrement sur l’observation des programmes parallèles s’exécutant sur des milliers de cœurs. Dans ce but, nous décrivons d’abord le processus de développement de manière globale avant de présenter les outils existants et les travaux associés. Dans un second temps, nous détaillons notre contribution qui consiste d’une part en des outils de débogage et profilage par prise de traces, et d’autre part en leur évolution vers un couplage en ligne qui palie les limitations d’entrées–sorties. Notre contribution couvre également la synchronisation des horloges pour la prise de traces avec la présentation d’un algorithme de synchronisation probabiliste dont nous avons quantifié l’erreur. En outre, nous décrivons un outil de caractérisation machine qui couvre l’aspect MPI. Un tel outil met en évidence la présence de bruit aussi bien sur les communications de type point-à-point que de type collective. Enfin, nous proposons et motivons une alternative à la collecte d’événements par prise de traces tout en préservant la granularité des événements et un impact réduit sur les performances, tant sur le volet utilisation CPU que sur les entrées–sortie

    Computer Science and Technology Series : XV Argentine Congress of Computer Science. Selected papers

    Get PDF
    CACIC'09 was the fifteenth Congress in the CACIC series. It was organized by the School of Engineering of the National University of Jujuy. The Congress included 9 Workshops with 130 accepted papers, 1 main Conference, 4 invited tutorials, different meetings related with Computer Science Education (Professors, PhD students, Curricula) and an International School with 5 courses. CACIC 2009 was organized following the traditional Congress format, with 9 Workshops covering a diversity of dimensions of Computer Science Research. Each topic was supervised by a committee of three chairs of different Universities. The call for papers attracted a total of 267 submissions. An average of 2.7 review reports were collected for each paper, for a grand total of 720 review reports that involved about 300 different reviewers. A total of 130 full papers were accepted and 20 of them were selected for this book.Red de Universidades con Carreras en Informática (RedUNCI
    corecore