45 research outputs found

    Tolerating multiple faults in multistage interconnection networks with minimal extra stages

    Get PDF
    Adams and Siegel (1982) proposed an extra stage cube interconnection network that tolerates one switch failure with one extra stage. We extend their results and discover a class of extra stage interconnection networks that tolerate multiple switch failures with a minimal number of extra stages. Adopting the same fault model as Adams and Siegel, the faulty switches can be bypassed by a pair of demultiplexer/multiplexer combinations. It is easy to show that, to maintain point to point and broadcast connectivities, there must be at least S extra stages to tolerate I switch failures. We present the first known construction of an extra stage interconnection network that meets this lower-bound. This 12-dimensional multistage interconnection network has n+f stages and tolerates I switch failures. An n-bit label called mask is used for each stage that indicates the bit differences between the two inputs coming into a common switch. We designed the fault-tolerant construction such that it repeatedly uses the singleton basis of the n-dimensional vector space as the stage mask vectors. This construction is further generalized and we prove that an n-dimensional multistage interconnection network is optimally fault-tolerant if and only if the mask vectors of every n consecutive stages span the n-dimensional vector space

    Tolerating multiple faults in multistage interconnection networks with minimal extra stages

    Full text link

    A Family of Fault-Tolerant Efficient Indirect Topologies

    Full text link
    © 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.On the one hand, performance and fault-tolerance of interconnection networks are key design issues for high performance computing (HPC) systems. On the other hand, cost should be also considered. Indirect topologies are often chosen in the design of HPC systems. Among them, the most commonly used topology is the fat-tree. In this work, we focus on getting the maximum benefits from the network resources by designing a simple indirect topology with very good performance and fault-tolerance properties, while keeping the hardware cost as low as possible. To do that, we propose some extensions to the fat-tree topology to take full advantage of the hardware resources consumed by the topology. In particular, we propose three new topologies with different properties in terms of cost, performance and fault-tolerance. All of them are able to achieve a similar or better performance results than the fat-tree, providing also a good level of fault-tolerance and, contrary to most of the available topologies, these proposals are able to tolerate also faults in the links that connect to end nodes.This work was supported by the Spanish Ministerio de Economia y Competitividad (MINECO) and by FEDER funds under Grant TIN2012-38341-C04-01.Bermúdez Garzón, DF.; Gómez Requena, C.; Gómez Requena, ME.; López Rodríguez, PJ.; Duato Marín, JF. (2016). A Family of Fault-Tolerant Efficient Indirect Topologies. IEEE Transactions on Parallel and Distributed Systems. 27(4):927-940. https://doi.org/10.1109/TPDS.2015.2430863S92794027

    Modeling and Analysis of Fault Tolerant Multistage Interconnection Networks

    Get PDF
    Performance and reliability are two of the most crucial issues in today\u27s high-performance instrumentation and measurement systems. High speed and compact density multistage interconnection networks (MINs) are widely-used subsystems in different applications. New performance models are proposed to evaluate a novel fault tolerant MIN arrangement, thereby assuring performance and reliability with high confidence level. A concurrent fault detection and recovery scheme for MINs is considered by rerouting over redundant interconnection links under stringent real-time constraints for digital instrumentation as sensor networks. A switch architecture for concurrent testing and diagnosis is proposed. New performance models are developed and used to evaluate the compound effect of fault tolerant operation (inclusive of testing, diagnosis, and recovery) on the overall throughput and delay. Results are shown for single transient and permanent stuck-at faults on links and storage units in the switching elements. It is shown that performance degradation due to fault tolerance is graceful while performance degradation without fault recovery is unacceptable

    A fault-tolerant routing strategy for k-ary n-direct s-indirect topologies based on intermediate nodes

    Full text link
    [EN] Exascale computing systems are being built with thousands of nodes. The high number of components of these systems significantly increases the probability of failure. A key component for them is the interconnection network. If failures occur in the interconnection network, they may isolate a large fraction of the machine. For this reason, an efficient fault-tolerant mechanism is needed to keep the system interconnected, even in the presence of faults. A recently proposed topology for these large systems is the hybrid k-ary n-direct s-indirect family that provides optimal performance and connectivity at a reduced hardware cost. This paper presents a fault-tolerant routing methodology for the k-ary n-direct s-indirect topology that degrades performance gracefully in presence of faults and tolerates a large number of faults without disabling any healthy computing node. In order to tolerate network failures, the methodology uses a simple mechanism. For any source-destination pair, if necessary, packets are forwarded to the destination node through a set of intermediate nodes (without being ejected from the network) with the aim of circumventing faults. The evaluation results shows that the proposed methodology tolerates a large number of faults. For instance, it is able to tolerate more than 99.5% of fault combinations when there are 10 faults in a 3-D network with 1000 nodes using only 1 intermediate node and more than 99.98% if 2 intermediate nodes are used. Furthermore, the methodology offers a gracious performance degradation. As an example, performance degrades only by 1% for a 2-D network with 1024 nodes and 1% faulty links.This work was supported by the Spanish Ministerio de Economía y Competitividad (MINECO), by FEDER funds under Grant TIN2015-66972-C5-1-R, by Programa de Ayudas de Investigación y Desarrollo (PAID) from Universitat Politècnica de alència and by the financial support of the FP7 HiPEAC Network of Excellence under grant agreement 287759Peñaranda Cebrián, R.; Gómez Requena, ME.; López Rodríguez, PJ.; Gran, EG.; Skeie, T. (2017). A fault-tolerant routing strategy for k-ary n-direct s-indirect topologies based on intermediate nodes. Concurrency and Computation Practice and Experience. 29(13):1-11. https://doi.org/10.1002/cpe.4065S111291

    Fault-tolerant interconnection networks for multiprocessor systems

    Get PDF
    Interconnection networks represent the backbone of multiprocessor systems. A failure in the network, therefore, could seriously degrade the system performance. For this reason, fault tolerance has been regarded as a major consideration in interconnection network design. This thesis presents two novel techniques to provide fault tolerance capabilities to three major networks: the Baseline network, the Benes network and the Clos network. First, the Simple Fault Tolerance Technique (SFT) is presented. The SFT technique is in fact the result of merging two widely known interconnection mechanisms: a normal interconnection network and a shared bus. This technique is most suitable for networks with small switches, such as the Baseline network and the Benes network. For the Clos network, whose switches may be large for the SFT, another technique is developed to produce the Fault-Tolerant Clos (FTC) network. In the FTC, one switch is added to each stage. The two techniques are described and thoroughly analyzed

    Low-Memory Techniques for Routing and Fault-Tolerance on the Fat-Tree Topology

    Full text link
    Actualmente, los clústeres de PCs están considerados como una alternativa eficiente a la hora de construir supercomputadores en los que miles de nodos de computación se conectan mediante una red de interconexión. La red de interconexión tiene que ser diseñada cuidadosamente, puesto que tiene una gran influencia sobre las prestaciones globales del sistema. Dos de los principales parámetros de diseño de las redes de interconexión son la topología y el encaminamiento. La topología define la interconexión de los elementos de la red entre sí, y entre éstos y los nodos de computación. Por su parte, el encaminamiento define los caminos que siguen los paquetes a través de la red. Las prestaciones han sido tradicionalmente la principal métrica a la hora de evaluar las redes de interconexión. Sin embargo, hoy en día hay que considerar dos métricas adicionales: el coste y la tolerancia a fallos. Las redes de interconexión además de escalar en prestaciones también deben hacerlo en coste. Es decir, no sólo tienen que mantener su productividad conforme aumenta el tamaño de la red, sino que tienen que hacerlo sin incrementar sobremanera su coste. Por otra parte, conforme se incrementa el número de nodos en las máquinas de tipo clúster, la red de interconexión debe crecer en concordancia. Este incremento en el número de elementos de la red de interconexión aumenta la probabilidad de aparición de fallos, y por lo tanto, la tolerancia a fallos es prácticamente obligatoria para las redes de interconexión actuales. Esta tesis se centra en la topología fat-tree, ya que es una de las topologías más comúnmente usadas en los clústeres. El objetivo de esta tesis es aprovechar sus características particulares para proporcionar tolerancia a fallos y un algoritmo de encaminamiento capaz de equilibrar la carga de la red proporcionando una buena solución de compromiso entre las prestaciones y el coste.Gómez Requena, C. (2010). Low-Memory Techniques for Routing and Fault-Tolerance on the Fat-Tree Topology [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8856Palanci

    Efficient mechanisms to provide fault tolerance in interconnection networks for pc clusters

    Full text link
    Actualmente, los clusters de PC son un alternativa rentable a los computadores paralelos. En estos sistemas, miles de componentes (procesadores y/o discos duros) se conectan a través de redes de interconexión de altas prestaciones. Entre las tecnologías de red actualmente disponibles para construir clusters, InfiniBand (IBA) ha emergido como un nuevo estándar de interconexión para clusters. De hecho, ha sido adoptado por muchos de los sistemas más potentes construidos actualmente (lista top500). A medida que el número de nodos aumenta en estos sistemas, la red de interconexión también crece. Junto con el aumento del número de componentes la probabilidad de averías aumenta dramáticamente, y así, la tolerancia a fallos en el sistema en general, y de la red de interconexión en particular, se convierte en una necesidad. Desafortunadamente, la mayor parte de las estrategias de encaminamiento tolerantes a fallos propuestas para los computadores masivamente paralelos no pueden ser aplicadas porque el encaminamiento y las transiciones de canal virtual son deterministas en IBA, lo que impide que los paquetes eviten los fallos. Por lo tanto, son necesarias nuevas estrategias para tolerar fallos. Por ello, esta tesis se centra en proporcionar los niveles adecuados de tolerancia a fallos a los clusters de PC, y en particular a las redes IBA. En esta tesis proponemos y evaluamos varios mecanismos adecuados para las redes de interconexión para clusters. El primer mecanismo para proporcionar tolerancia a fallos en IBA (al que nos referimos como encaminamiento tolerante a fallos basado en transiciones; TFTR) consiste en usar varias rutas disjuntas entre cada par de nodos origen-destino y seleccionar la ruta apropiada en el nodo fuente usando el mecanismo APM proporcionado por IBA. Consiste en migrar las rutas afectadas por el fallo a las rutas alternativas sin fallos. Sin embargo, con este fin, es necesario un algoritmo eficiente de encaminamiento capaz de proporcionar suficientesMontañana Aliaga, JM. (2008). Efficient mechanisms to provide fault tolerance in interconnection networks for pc clusters [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/2603Palanci
    corecore