25 research outputs found

    Fault-free longest paths in star networks with conditional link faults

    Get PDF
    AbstractThe star network, which belongs to the class of Cayley graphs, is one of the most versatile interconnection networks for parallel and distributed computing. In this paper, adopting the conditional fault model in which each node is assumed to be incident with two or more fault-free links, we show that an n-dimensional star network can tolerate up to 2n−7 link faults, and be strongly (fault-free) Hamiltonian laceable, where n≥4. In other words, we can embed a fault-free linear array of length n!−1 (n!−2) in an n-dimensional star network with up to 2n−7 link faults, if the two end nodes belong to different partite sets (the same partite set). The result is optimal with respect to the number of link faults tolerated. It is already known that under the random fault model, an n-dimensional star network can tolerate up to n−3 faulty links and be strongly Hamiltonian laceable, for n≥3

    Fault-tolerance embedding of rings and arrays in star and pancake graphs

    Full text link
    The star and pancake graphs are useful interconnection networks for connecting processors in a parallel and distributed computing environment. The star network has been widely studied and is shown to possess attactive features like sublogarithmic diameter, node and edge symmetry and high resilience. The star/pancake interconnection graphs, {dollar}S\sb{n}/P\sb{n}{dollar} of dimension n have n! nodes connected by {dollar}{(n-1).n!\over2}{dollar} edges. Due to their large number of nodes and interconnections, they are prone to failure of one or more nodes/edges; In this thesis, we present methods to embed Hamiltonian paths (H-path) and Hamiltonian cycles (H-cycle) in a star graph {dollar}S\sb{n}{dollar} and pancake graph {dollar}P\sb{n}{dollar} in a faulty environment. Such embeddings are important for solving computational problems, formulated for array and ring topologies, on star and pancake graphs. The models considered include single-processor failure, double-processor failure, and multiple-processor failures. All the models are applied to an H-cycle which is formed by visiting all the ({dollar}{n!\over4!})\ S\sb4/P\sb4{dollar}s in an {dollar}S\sb{n}/P\sb{n}{dollar} in a particular order. Each {dollar}S\sb4/P\sb4{dollar} has an entry node where the cycle/path enters that particular {dollar}S\sb4/P\sb4{dollar} and an exit node where the path leaves it. Distributed algorithms for embedding hamiltonian cycle in the presence of multiple faults, are also presented for both {dollar}S\sb{n}{dollar} and {dollar}P\sb{n}{dollar}

    Efficient mechanisms to provide fault tolerance in interconnection networks for pc clusters

    Full text link
    Actualmente, los clusters de PC son un alternativa rentable a los computadores paralelos. En estos sistemas, miles de componentes (procesadores y/o discos duros) se conectan a través de redes de interconexión de altas prestaciones. Entre las tecnologías de red actualmente disponibles para construir clusters, InfiniBand (IBA) ha emergido como un nuevo estándar de interconexión para clusters. De hecho, ha sido adoptado por muchos de los sistemas más potentes construidos actualmente (lista top500). A medida que el número de nodos aumenta en estos sistemas, la red de interconexión también crece. Junto con el aumento del número de componentes la probabilidad de averías aumenta dramáticamente, y así, la tolerancia a fallos en el sistema en general, y de la red de interconexión en particular, se convierte en una necesidad. Desafortunadamente, la mayor parte de las estrategias de encaminamiento tolerantes a fallos propuestas para los computadores masivamente paralelos no pueden ser aplicadas porque el encaminamiento y las transiciones de canal virtual son deterministas en IBA, lo que impide que los paquetes eviten los fallos. Por lo tanto, son necesarias nuevas estrategias para tolerar fallos. Por ello, esta tesis se centra en proporcionar los niveles adecuados de tolerancia a fallos a los clusters de PC, y en particular a las redes IBA. En esta tesis proponemos y evaluamos varios mecanismos adecuados para las redes de interconexión para clusters. El primer mecanismo para proporcionar tolerancia a fallos en IBA (al que nos referimos como encaminamiento tolerante a fallos basado en transiciones; TFTR) consiste en usar varias rutas disjuntas entre cada par de nodos origen-destino y seleccionar la ruta apropiada en el nodo fuente usando el mecanismo APM proporcionado por IBA. Consiste en migrar las rutas afectadas por el fallo a las rutas alternativas sin fallos. Sin embargo, con este fin, es necesario un algoritmo eficiente de encaminamiento capaz de proporcionar suficientesMontañana Aliaga, JM. (2008). Efficient mechanisms to provide fault tolerance in interconnection networks for pc clusters [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/2603Palanci

    Routing on the Channel Dependency Graph:: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks

    Get PDF
    In the pursuit for ever-increasing compute power, and with Moore's law slowly coming to an end, high-performance computing started to scale-out to larger systems. Alongside the increasing system size, the interconnection network is growing to accommodate and connect tens of thousands of compute nodes. These networks have a large influence on total cost, application performance, energy consumption, and overall system efficiency of the supercomputer. Unfortunately, state-of-the-art routing algorithms, which define the packet paths through the network, do not utilize this important resource efficiently. Topology-aware routing algorithms become increasingly inapplicable, due to irregular topologies, which either are irregular by design, or most often a result of hardware failures. Exchanging faulty network components potentially requires whole system downtime further increasing the cost of the failure. This management approach becomes more and more impractical due to the scale of today's networks and the accompanying steady decrease of the mean time between failures. Alternative methods of operating and maintaining these high-performance interconnects, both in terms of hardware- and software-management, are necessary to mitigate negative effects experienced by scientific applications executed on the supercomputer. However, existing topology-agnostic routing algorithms either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables. Using the fail-in-place strategy, a well-established method for storage systems to repair only critical component failures, is a feasible solution for current and future HPC interconnects as well as other large-scale installations such as data center networks. Although, an appropriate combination of topology and routing algorithm is required to minimize the throughput degradation for the entire system. This thesis contributes a network simulation toolchain to facilitate the process of finding a suitable combination, either during system design or while it is in operation. On top of this foundation, a key contribution is a novel scheduling-aware routing, which reduces fault-induced throughput degradation while improving overall network utilization. The scheduling-aware routing performs frequent property preserving routing updates to optimize the path balancing for simultaneously running batch jobs. The increased deployment of lossless interconnection networks, in conjunction with fail-in-place modes of operation and topology-agnostic, scheduling-aware routing algorithms, necessitates new solutions to solve the routing-deadlock problem. Therefore, this thesis further advances the state-of-the-art by introducing a novel concept of routing on the channel dependency graph, which allows the design of an universally applicable destination-based routing capable of optimizing the path balancing without exceeding a given number of virtual channels, which are a common hardware limitation. This disruptive innovation enables implicit deadlock-avoidance during path calculation, instead of solving both problems separately as all previous solutions
    corecore