681 research outputs found

    New Fault Tolerant Multicast Routing Techniques to Enhance Distributed-Memory Systems Performance

    Get PDF
    Distributed-memory systems are a key to achieve high performance computing and the most favorable architectures used in advanced research problems. Mesh connected multicomputer are one of the most popular architectures that have been implemented in many distributed-memory systems. These systems must support communication operations efficiently to achieve good performance. The wormhole switching technique has been widely used in design of distributed-memory systems in which the packet is divided into small flits. Also, the multicast communication has been widely used in distributed-memory systems which is one source node sends the same message to several destination nodes. Fault tolerance refers to the ability of the system to operate correctly in the presence of faults. Development of fault tolerant multicast routing algorithms in 2D mesh networks is an important issue. This dissertation presents, new fault tolerant multicast routing algorithms for distributed-memory systems performance using wormhole routed 2D mesh. These algorithms are described for fault tolerant routing in 2D mesh networks, but it can also be extended to other topologies. These algorithms are a combination of a unicast-based multicast algorithm and tree-based multicast algorithms. These algorithms works effectively for the most commonly encountered faults in mesh networks, f-rings, f-chains and concave fault regions. It is shown that the proposed routing algorithms are effective even in the presence of a large number of fault regions and large size of fault region. These algorithms are proved to be deadlock-free. Also, the problem of fault regions overlap is solved. Four essential performance metrics in mesh networks will be considered and calculated; also these algorithms are a limited-global-information-based multicasting which is a compromise of local-information-based approach and global-information-based approach. Data mining is used to validate the results and to enlarge the sample. The proposed new multicast routing techniques are used to enhance the performance of distributed-memory systems. Simulation results are presented to demonstrate the efficiency of the proposed algorithms

    A Limited-Global Fault Information Model for Dynamic Routing in n-D Meshes

    Get PDF

    A fault-tolerant routing strategy for k-ary n-direct s-indirect topologies based on intermediate nodes

    Full text link
    [EN] Exascale computing systems are being built with thousands of nodes. The high number of components of these systems significantly increases the probability of failure. A key component for them is the interconnection network. If failures occur in the interconnection network, they may isolate a large fraction of the machine. For this reason, an efficient fault-tolerant mechanism is needed to keep the system interconnected, even in the presence of faults. A recently proposed topology for these large systems is the hybrid k-ary n-direct s-indirect family that provides optimal performance and connectivity at a reduced hardware cost. This paper presents a fault-tolerant routing methodology for the k-ary n-direct s-indirect topology that degrades performance gracefully in presence of faults and tolerates a large number of faults without disabling any healthy computing node. In order to tolerate network failures, the methodology uses a simple mechanism. For any source-destination pair, if necessary, packets are forwarded to the destination node through a set of intermediate nodes (without being ejected from the network) with the aim of circumventing faults. The evaluation results shows that the proposed methodology tolerates a large number of faults. For instance, it is able to tolerate more than 99.5% of fault combinations when there are 10 faults in a 3-D network with 1000 nodes using only 1 intermediate node and more than 99.98% if 2 intermediate nodes are used. Furthermore, the methodology offers a gracious performance degradation. As an example, performance degrades only by 1% for a 2-D network with 1024 nodes and 1% faulty links.This work was supported by the Spanish Ministerio de Economía y Competitividad (MINECO), by FEDER funds under Grant TIN2015-66972-C5-1-R, by Programa de Ayudas de Investigación y Desarrollo (PAID) from Universitat Politècnica de alència and by the financial support of the FP7 HiPEAC Network of Excellence under grant agreement 287759Peñaranda Cebrián, R.; Gómez Requena, ME.; López Rodríguez, PJ.; Gran, EG.; Skeie, T. (2017). A fault-tolerant routing strategy for k-ary n-direct s-indirect topologies based on intermediate nodes. Concurrency and Computation Practice and Experience. 29(13):1-11. https://doi.org/10.1002/cpe.4065S111291

    On Constructing the Minimum Orthogonal Convex Polygon in 2-D Faulty Meshes

    Get PDF

    Transient and Permanent Error Control for High-End Multiprocessor Systems-on-Chip

    Get PDF
    High-end MPSoC systems with built-in high-radix topologies achieve good performance because of the improved connectivity and the reduced network diameter. In high-end MPSoC systems, fault tolerance support is becoming a compulsory feature. In this work, we propose a combined method to address permanent and transient link and router failures in those systems. The LBDRhr mechanism is proposed to tolerate permanent link failures in some popular high-radix topologies. The increased router complexity may lead to more transient router errors than routers using simple XY routing algorithm. We exploit the inherent information redundancy (IIR) in LBDRhr logic to manage transient errors in the network routers. Thorough analyses are provided to discover the appropriate internal nodes and the forbidden signal patterns for transient error detection. Simulation results show that LBDRhr logic can tolerate all of the permanent failure combinations of long-range links and 80% of links failures at short-range links. Case studies show that the error detection method based on the new IIR extraction method reduces the power consumption and the residual error rate by 33% and up to two orders of magnitude, respectively, compared to triple modular redundancy. The impact of network topologies on the efficiency of the detection mechanism has been examined in this work, as well

    Resilient Routing Implementation in 2D Mesh NoC

    No full text
    With the rapid shrinking of technology and growing integration capacity, the probability of failures in Networks-on-Chip (NoCs) increases and thus, fault tolerance is essential. Moreover, the unpredictable locations of these failures may influence the regularity of the underlying topology, and a regular 2D mesh is likely to become irregular. Thus, for these failure-prone networks, a viable routing framework should comprise a topology-agnostic routing algorithm along with a cost-effective, scalable routing mechanism able to handle failures, irrespective of any particular failure patterns. Existing routing techniques designed to route irregular topologies efficiently lack flexibility (logic-based), scalability (table-based) or relaxed switch design (uLBDR-based). Designing an efficient routing implementation technique to address irregular topologies remains a pressing research problem. To address this, we present a fault resilient routing mechanism for irregular 2D meshes resulting from failures. To handle irregularities, it avoids using routing tables and employs a few fixed configuration bits per switch resulting in a scalable approach. Experiments demonstrate that the proposed approach is guaranteed to tolerate all locations of single and double-link failures and most multiple failures. Also, unlike uLBDR it is not restricted to any particular switching technique and does not replicate any extra messages. Along with fault tolerance, the proposed mechanism can achieve better network performance in fault-free cases. The proposed technique achieves graceful performance degradation during failure. Compared to uLBDR, our method has 14% less area requirements and 16% less overall power consumption

    On Achieving the Shortest-Path Routing in 2-D Meshes

    Get PDF
    corecore