53 research outputs found

    OutFlank Routing: Increasing Throughput in Toroidal Interconnection Networks

    Full text link
    We present a new, deadlock-free, routing scheme for toroidal interconnection networks, called OutFlank Routing (OFR). OFR is an adaptive strategy which exploits non-minimal links, both in the source and in the destination nodes. When minimal links are congested, OFR deroutes packets to carefully chosen intermediate destinations, in order to obtain travel paths which are only an additive constant longer than the shortest ones. Since routing performance is very sensitive to changes in the traffic model or in the router parameters, an accurate discrete-event simulator of the toroidal network has been developed to empirically validate OFR, by comparing it against other relevant routing strategies, over a range of typical real-world traffic patterns. On the 16x16x16 (4096 nodes) simulated network OFR exhibits improvements of the maximum sustained throughput between 14% and 114%, with respect to Adaptive Bubble Routing.Comment: 9 pages, 5 figures, to be presented at ICPADS 201

    Routing on the Channel Dependency Graph:: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks

    Get PDF
    In the pursuit for ever-increasing compute power, and with Moore's law slowly coming to an end, high-performance computing started to scale-out to larger systems. Alongside the increasing system size, the interconnection network is growing to accommodate and connect tens of thousands of compute nodes. These networks have a large influence on total cost, application performance, energy consumption, and overall system efficiency of the supercomputer. Unfortunately, state-of-the-art routing algorithms, which define the packet paths through the network, do not utilize this important resource efficiently. Topology-aware routing algorithms become increasingly inapplicable, due to irregular topologies, which either are irregular by design, or most often a result of hardware failures. Exchanging faulty network components potentially requires whole system downtime further increasing the cost of the failure. This management approach becomes more and more impractical due to the scale of today's networks and the accompanying steady decrease of the mean time between failures. Alternative methods of operating and maintaining these high-performance interconnects, both in terms of hardware- and software-management, are necessary to mitigate negative effects experienced by scientific applications executed on the supercomputer. However, existing topology-agnostic routing algorithms either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables. Using the fail-in-place strategy, a well-established method for storage systems to repair only critical component failures, is a feasible solution for current and future HPC interconnects as well as other large-scale installations such as data center networks. Although, an appropriate combination of topology and routing algorithm is required to minimize the throughput degradation for the entire system. This thesis contributes a network simulation toolchain to facilitate the process of finding a suitable combination, either during system design or while it is in operation. On top of this foundation, a key contribution is a novel scheduling-aware routing, which reduces fault-induced throughput degradation while improving overall network utilization. The scheduling-aware routing performs frequent property preserving routing updates to optimize the path balancing for simultaneously running batch jobs. The increased deployment of lossless interconnection networks, in conjunction with fail-in-place modes of operation and topology-agnostic, scheduling-aware routing algorithms, necessitates new solutions to solve the routing-deadlock problem. Therefore, this thesis further advances the state-of-the-art by introducing a novel concept of routing on the channel dependency graph, which allows the design of an universally applicable destination-based routing capable of optimizing the path balancing without exceeding a given number of virtual channels, which are a common hardware limitation. This disruptive innovation enables implicit deadlock-avoidance during path calculation, instead of solving both problems separately as all previous solutions

    Performance evaluation of different routing algorithms in network on chip

    Get PDF
    Network on Chip (NoC) is one of the efficient on-chip communication architecture for System on Chip (SoC) where a large number of computational and storage blocks are integrated on a single chip. NoCs have tackled the disadvantages of SoCs as well as they are scalable. But an efficient routing algorithm can enhance the performance of NoC. In one chapter of the thesis three different types of routing algorithms are compared i.e. XY, OE, and DyAD. XY routing algorithm is a distributed deterministic algorithm. Odd-Even (OE) routing algorithm is distributed adaptive routing algorithm with deadlock-free ability. DyAD is a smart routing algorithm which combines the features of both deterministic and adaptive routing. In another chapter of thesis three different types of deadlock free routing algorithms are compared i.e. one deterministic routing (XY routing algorithm), three partially adaptive routing (West first, North last and Negative first) and two adaptive routing (DyXY, OE) are being compared with % of load for various traffic patterns. In another chapter of thesis, a fault tolerant algorithm is described and its performance is compared with all the deadlock free routing algorithms in a NoC having link faults and node faults. All these simulation is done in NIRGAM 2.1 simulator which is a cycle accurate systemC based simulator

    Reducing the spectral radius of a torus network by link removal

    Full text link
    The optimal link removal (OLR) problem aims at removing a given number of links of a network so that the spectral radius of the residue network obtained by removing the links from the network attains the minimum. Torus networks are a class of regular networks that have witnessed widespread applications. This paper addresses three subproblems of the OLR problem for torus networks, where two or three or four edges are removed. For either of the three subproblems, a link-removing scheme is described. Exhaustive searches show that, for small-sized tori, each of the proposed schemes produces an optimal solution to the corresponding subproblem. Monte-Carlo simulations demonstrate that, for medium-sized tori, each of the three schemes produces a solution to the corresponding subproblem, which is optimal when compared to a large set of randomly produced link-removing schemes. Consequently, it is speculated that each of the three schemes produces an optimal solution to the corresponding subproblem for all torus networks. The set of links produced by each of our schemes is evenly distributed over a network, which may be a common feature of an optimal solution to the OLR problem for regular networks

    High performance communication on reconfigurable clusters

    Get PDF
    High Performance Computing (HPC) has matured to where it is an essential third pillar, along with theory and experiment, in most domains of science and engineering. Communication latency is a key factor that is limiting the performance of HPC, but can be addressed by integrating communication into accelerators. This integration allows accelerators to communicate with each other without CPU interactions, and even bypassing the network stack. Field Programmable Gate Arrays (FPGAs) are the accelerators that currently best integrate communication with computation. The large number of Multi-gigabit Transceivers (MGTs) on most high-end FPGAs can provide high-bandwidth and low-latency inter-FPGA connections. Additionally, the reconfigurable FPGA fabric enables tight coupling between computation kernel and network interface. Our thesis is that an application-aware communication infrastructure for a multi-FPGA system makes substantial progress in solving the HPC communication bottleneck. This dissertation aims to provide an application-aware solution for communication infrastructure for FPGA-centric clusters. Specifically, our solution demonstrates application-awareness across multiple levels in the network stack, including low-level link protocols, router microarchitectures, routing algorithms, and applications. We start by investigating the low-level link protocol and the impact of its latency variance on performance. Our results demonstrate that, although some link jitter is always present, we can still assume near-synchronous communication on an FPGA-cluster. This provides the necessary condition for statically-scheduled routing. We then propose two novel router microarchitectures for two different kinds of workloads: a wormhole Virtual Channel (VC)-based router for workloads with dynamic communication, and a statically-scheduled Virtual Output Queueing (VOQ)-based router for workloads with static communication. For the first (VC-based) router, we propose a framework that generates application-aware router configurations. Our results show that, by adding application-awareness into router configuration, the network performance of FPGA clusters can be substantially improved. For the second (VOQ-based) router, we propose a novel offline collective routing algorithm. This shows a significant advantage over a state-of-the-art collective routing algorithm. We apply our communication infrastructure to a critical strong-scaling HPC kernel, the 3D FFT. The experimental results demonstrate that the performance of our design is faster than that on CPUs and GPUs by at least one order of magnitude (achieving strong scaling for the target applications). Surprisingly, the FPGA cluster performance is similar to that of an ASIC-cluster. We also implement the 3D FFT on another multi-FPGA platform: the Microsoft Catapult II cloud. Its performance is also comparable or superior to CPU and GPU HPC clusters. The second application we investigate is Molecular Dynamics Simulation (MD). We model MD on both FPGA clouds and clusters. We find that combining processing and general communication in the same device leads to extremely promising performance and the prospect of MD simulations well into the us/day range with a commodity cloud

    Application-aware deadlock-free oblivious routing

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 67-71).Systems that can be integrated on a single silicon die have become larger and increasingly complex, and wire designs as communication mechanisms for these systems on chip (SoC) have shown to be a limiting factor in their performance. As an approach to remove the limitation of communication and to overcome wire delays, interconnection networks or Network-on-Chip (NoC) architectures have emerged. NoC architectures enable faster data communication between components and are more scalable. In designing NoC systems, there are three key issues; the topology, which directly depends on packaging technology and manufacturing costs, dictates the throughput and latency bounds of the network; the flit control protocol, which establishes how the network resources are allocated to packets exchanged between components; and finally, the routing algorithm, which aims at optimizing network performance for some topology and flow control protocol by selecting appropriate paths for those packets. Since the routing algorithm sits on top of the other layers of design, it is critical that routing is done in a matter that makes good usage of the resources of the network. Two main approaches to routing, oblivious and adaptive, have been followed in creating routing algorithms for these systems. Each approach has its pros and cons; oblivious routing, as opposite to adaptive routing, uses no network state information in determining routes at the cost of lower performance on certain applications, but has been widely used because of its simpler hardware requirements.(cont.) This thesis examines oblivious routing schemes for NoC architectures. It introduces various non-minimal, oblivious routing algorithms that globally allocate network bandwidth for a given application when estimated bandwidths for data transfers are provided, while ensuring deadlock freedom with no significant additional hardware. The work presents and evaluates these oblivious routing algorithms which attempt to minimize the maximum channel load (MCL) across all network links in an effort to maximize application throughput. Simulation results from popular synthetic benchmarks and concrete applications, such as an H.264 decoder, show that it is possible to achieve better performance than traditional deterministic and oblivious routing schemes.by Michel A. Kinsy.S.M

    Scheduling in Transactional Memory Systems: Models, Algorithms, and Evaluations

    Get PDF
    Transactional memory provides an alternative synchronization mechanism that removes many limitations of traditional lock-based synchronization so that concurrent program writing is easier than lock-based code in modern multicore architectures. The fundamental module in a transactional memory system is the transaction which represents a sequence of read and write operations that are performed atomically to a set of shared resources; transactions may conflict if they access the same shared resources. A transaction scheduling algorithm is used to handle these transaction conflicts and schedule appropriately the transactions. In this dissertation, we study transaction scheduling problem in several systems that differ through the variation of the intra-core communication cost in accessing shared resources. Symmetric communication costs imply tightly-coupled systems, asymmetric communication costs imply large-scale distributed systems, and partially asymmetric communication costs imply non-uniform memory access systems. We made several theoretical contributions providing tight, near-tight, and/or impossibility results on three different performance evaluation metrics: execution time, communication cost, and load, for any transaction scheduling algorithm. We then complement these theoretical results by experimental evaluations, whenever possible, showing their benefits in practical scenarios. To the best of our knowledge, the contributions of this dissertation are either the first of their kind or significant improvements over the best previously known results
    corecore