756 research outputs found

    Routing on the Channel Dependency Graph:: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks

    Get PDF
    In the pursuit for ever-increasing compute power, and with Moore's law slowly coming to an end, high-performance computing started to scale-out to larger systems. Alongside the increasing system size, the interconnection network is growing to accommodate and connect tens of thousands of compute nodes. These networks have a large influence on total cost, application performance, energy consumption, and overall system efficiency of the supercomputer. Unfortunately, state-of-the-art routing algorithms, which define the packet paths through the network, do not utilize this important resource efficiently. Topology-aware routing algorithms become increasingly inapplicable, due to irregular topologies, which either are irregular by design, or most often a result of hardware failures. Exchanging faulty network components potentially requires whole system downtime further increasing the cost of the failure. This management approach becomes more and more impractical due to the scale of today's networks and the accompanying steady decrease of the mean time between failures. Alternative methods of operating and maintaining these high-performance interconnects, both in terms of hardware- and software-management, are necessary to mitigate negative effects experienced by scientific applications executed on the supercomputer. However, existing topology-agnostic routing algorithms either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables. Using the fail-in-place strategy, a well-established method for storage systems to repair only critical component failures, is a feasible solution for current and future HPC interconnects as well as other large-scale installations such as data center networks. Although, an appropriate combination of topology and routing algorithm is required to minimize the throughput degradation for the entire system. This thesis contributes a network simulation toolchain to facilitate the process of finding a suitable combination, either during system design or while it is in operation. On top of this foundation, a key contribution is a novel scheduling-aware routing, which reduces fault-induced throughput degradation while improving overall network utilization. The scheduling-aware routing performs frequent property preserving routing updates to optimize the path balancing for simultaneously running batch jobs. The increased deployment of lossless interconnection networks, in conjunction with fail-in-place modes of operation and topology-agnostic, scheduling-aware routing algorithms, necessitates new solutions to solve the routing-deadlock problem. Therefore, this thesis further advances the state-of-the-art by introducing a novel concept of routing on the channel dependency graph, which allows the design of an universally applicable destination-based routing capable of optimizing the path balancing without exceeding a given number of virtual channels, which are a common hardware limitation. This disruptive innovation enables implicit deadlock-avoidance during path calculation, instead of solving both problems separately as all previous solutions

    LAPSES: A Recipe for High-Performance Adaptive Router Design

    Get PDF
    Earlier research has shown that adaptive routing can help in improving network performance. However, it has not received adequate attention in commercial routers mainly due to the additional hardware complexity, and the perceived cost and performance degradation that may result from this complexity. These concerns can be mitigated if one can design a cost-effective router that can support adaptive routing. This paper proposes a three step recipe — Look-Ahead routing, intelligent Path Selection, and an Economic Storage implementation, called the LAPSES approach — for cost-effective high performance pipelined adaptive router design. The first step, look-ahead routing, reduces a pipeline stage in the router by making table lookup and arbitration concurrent. Next, three new traffic-sensitive path selection heuristics (LRU, LFU and MAX-CREDIT) are proposed to select one of the available alternate paths. Finally, two techniques for reducing routing table size of the adaptive router are presented. These are called meta-table routing and economical storage. The proposed economical storage needs a routing table with only 9 and 27 entries for two and three dimensional meshes, respectively. All these design ideas are evaluated on a (16 16) mesh network via simulation. A fully adaptive algorithm and various traffic patterns are used to examine the performance benefits. Performance results show that the look-ahead design as well as the path selection heuristics boost network performance, while the economical storage approach turns out to be an ideal choice in comparison to full-table and meta-table options. We believe the router resulting from these three design enhancements can make adaptive routing a viable choice for interconnects.

    High-Performance, Low-Complexity Deadlock Avoidance for Arbitrary Topologies/Routings

    Get PDF
    Recently, the use of graph-based network topologies has been proposed as an alternative to traditional networks such as tori or fat-trees due to their very good topological characteristics. However they pose practical implementation challenges such as the lack of deadlock avoidance strategies. Previous proposals are either exceedingly complex, underutilise network resources or lack flexibility. We propose- and prove formally- three generic, low-complexity dead-lock avoidance mechanisms that only require local information. The main strengths of our method are its topology- and routing- independence and that the virtual channel count is bounded by the length of the longest path. We evaluate our proposed mechanisms against previous proposals through an extensive simulation study to measure the impact on the performance using both synthetic and realistic traffic. First we compare against a well-known HPC mechanism for dragonfly and achieved similar performance level. Then we moved to Graph-based networks and show that our mechanisms can greatly outperform traditional, spanning-tree based mechanisms, even if these use a much larger number of virtual channels. Overall, we find that our proposal provides a simple, flexible and high performance deadlock-avoidance solution

    Teichien sogo ketsugomo no tame no sukeraburuna rutingu shuho

    Get PDF

    Path-Based partitioning methods for 3D Networks-on-Chip with minimal adaptive routing

    Full text link
    © 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Combining the benefits of 3D ICs and Networks-on-Chip (NoCs) schemes provides a significant performance gain in Chip Multiprocessors (CMPs) architectures. As multicast communication is commonly used in cache coherence protocols for CMPs and in various parallel applications, the performance of these systems can be significantly improved if multicast operations are supported at the hardware level. In this paper, we present several partitioning methods for the path-based multicast approach in 3D mesh-based NoCs, each with different levels of efficiency. In addition, we develop novel analytical models for unicast and multicast traffic to explore the efficiency of each approach. In order to distribute the unicast and multicast traffic more efficiently over the network, we propose the Minimal and Adaptive Routing (MAR) algorithm for the presented partitioning methods. The analytical and experimental results show that an advantageous method named Recursive Partitioning (RP) outperforms the other approaches. RP recursively partitions the network until all partitions contain a comparable number of switches and thus the multicast traffic is equally distributed among several subsets and the network latency is considerably decreased. The simulation results reveal that the RP method can achieve performance improvement across all workloads while performance can be further improved by utilizing the MAR algorithm. Nineteen percent average and 42 percent maximum latency reduction are obtained on SPLASH-2 and PARSEC benchmarks running on a 64-core CMP.Ebrahimi, M.; Daneshtalab, M.; Liljeberg, P.; Plosila, J.; Flich Cardo, J.; Tenhunen, H. (2014). Path-Based partitioning methods for 3D Networks-on-Chip with minimal adaptive routing. IEEE Transactions on Computers. 63(3):718-733. doi:10.1109/TC.2012.255S71873363

    Adaptive Routing Approaches for Networked Many-Core Systems

    Get PDF
    Through advances in technology, System-on-Chip design is moving towards integrating tens to hundreds of intellectual property blocks into a single chip. In such a many-core system, on-chip communication becomes a performance bottleneck for high performance designs. Network-on-Chip (NoC) has emerged as a viable solution for the communication challenges in highly complex chips. The NoC architecture paradigm, based on a modular packet-switched mechanism, can address many of the on-chip communication challenges such as wiring complexity, communication latency, and bandwidth. Furthermore, the combined benefits of 3D IC and NoC schemes provide the possibility of designing a high performance system in a limited chip area. The major advantages of 3D NoCs are the considerable reductions in average latency and power consumption. There are several factors degrading the performance of NoCs. In this thesis, we investigate three main performance-limiting factors: network congestion, faults, and the lack of efficient multicast support. We address these issues by the means of routing algorithms. Congestion of data packets may lead to increased network latency and power consumption. Thus, we propose three different approaches for alleviating such congestion in the network. The first approach is based on measuring the congestion information in different regions of the network, distributing the information over the network, and utilizing this information when making a routing decision. The second approach employs a learning method to dynamically find the less congested routes according to the underlying traffic. The third approach is based on a fuzzy-logic technique to perform better routing decisions when traffic information of different routes is available. Faults affect performance significantly, as then packets should take longer paths in order to be routed around the faults, which in turn increases congestion around the faulty regions. We propose four methods to tolerate faults at the link and switch level by using only the shortest paths as long as such path exists. The unique characteristic among these methods is the toleration of faults while also maintaining the performance of NoCs. To the best of our knowledge, these algorithms are the first approaches to bypassing faults prior to reaching them while avoiding unnecessary misrouting of packets. Current implementations of multicast communication result in a significant performance loss for unicast traffic. This is due to the fact that the routing rules of multicast packets limit the adaptivity of unicast packets. We present an approach in which both unicast and multicast packets can be efficiently routed within the network. While suggesting a more efficient multicast support, the proposed approach does not affect the performance of unicast routing at all. In addition, in order to reduce the overall path length of multicast packets, we present several partitioning methods along with their analytical models for latency measurement. This approach is discussed in the context of 3D mesh networks.Siirretty Doriast

    Fail-in-Place Network Design: Interaction Between Topology, Routing Algorithm and Failures

    Full text link
    Abstract—The growing system size of high performance com-puters results in a steady decrease of the mean time between failures. Exchanging network components often requires whole system downtime which increases the cost of failures. In this work, we study a fail-in-place strategy where broken network elements remain untouched. We show, that a fail-in-place strategy is feasible for todays networks and the degradation is manageable, and provide guidelines for the design. Our network failure simulation toolchain allows system designers to extrapolate the performance degradation based on expected failure rates, and it can be used to evaluate the current state of a system. In a case study of real-world HPC systems, we will analyze the performance degradation throughout the systems lifetime under the assumption that faulty network components are not repaired, which results in a recommendation to change the used routing algorithm to improve the network performance as well as the fail-in-place characteristic. Keywords—Network design, network simulations, network man-agement, fail-in-place, routing protocols, fault tolerance, availability I

    NoCo: ILP-based worst-case contention estimation for mesh real-time manycores

    Get PDF
    Manycores are capable of providing the computational demands required by functionally-advanced critical applications in domains such as automotive and avionics. In manycores a network-on-chip (NoC) provides access to shared caches and memories and hence concentrates most of the contention that tasks suffer, with effects on the worst-case contention delay (WCD) of packets and tasks' WCET. While several proposals minimize the impact of individual NoC parameters on WCD, e.g. mapping and routing, there are strong dependences among these NoC parameters. Hence, finding the optimal NoC configurations requires optimizing all parameters simultaneously, which represents a multidimensional optimization problem. In this paper we propose NoCo, a novel approach that combines ILP and stochastic optimization to find NoC configurations in terms of packet routing, application mapping, and arbitration weight allocation. Our results show that NoCo improves other techniques that optimize a subset of NoC parameters.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness under grant TIN2015- 65316-P and the HiPEAC Network of Excellence. It also received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (agreement No. 772773). Carles Hernández is jointly supported by the MINECO and FEDER funds through grant TIN2014-60404-JIN. Jaume Abella has been partially supported by the Spanish Ministry of Economy and Competitiveness under Ramon y Cajal postdoctoral fellowship number RYC-2013-14717. Enrico Mezzetti has been partially supported by the Spanish Ministry of Economy and Competitiveness under Juan de la Cierva-Incorporaci´on postdoctoral fellowship number IJCI-2016-27396.Peer ReviewedPostprint (author's final draft

    CLEX: Yet Another Supercomputer Architecture?

    Get PDF
    We propose the CLEX supercomputer topology and routing scheme. We prove that CLEX can utilize a constant fraction of the total bandwidth for point-to-point communication, at delays proportional to the sum of the number of intermediate hops and the maximum physical distance between any two nodes. Moreover, % applying an asymmetric bandwidth assignment to the links, all-to-all communication can be realized (1+o(1))(1+o(1))-optimally both with regard to bandwidth and delays. This is achieved at node degrees of nεn^{\varepsilon}, for an arbitrary small constant ε(0,1]\varepsilon\in (0,1]. In contrast, these results are impossible in any network featuring constant or polylogarithmic node degrees. Through simulation, we assess the benefits of an implementation of the proposed communication strategy. Our results indicate that, for a million processors, CLEX can increase bandwidth utilization and reduce average routing path length by at least factors 1010 respectively 55 in comparison to a torus network. Furthermore, the CLEX communication scheme features several other properties, such as deadlock-freedom, inherent fault-tolerance, and canonical partition into smaller subsystems
    corecore