3 research outputs found

    Conditional Fault-Diameter of Torus Networks

    Get PDF

    Pol铆ticas de encaminamiento tolerantes a fallos

    Get PDF
    El uso intensivo y prolongado de computadores de altas prestaciones para ejecutar aplicaciones computacionalmente intensivas, sumado al elevado n煤mero de elementos que los componen, incrementan dr谩sticamente la probabilidad de ocurrencia de fallos durante su funcionamiento. El objetivo del trabajo es resolver el problema de tolerancia a fallos para redes de interconexi贸n de altas prestaciones, partiendo del dise帽o de pol铆颅ticas de encaminamiento tolerantes a fallos. Buscamos resolver una determinada cantidad de fallos de enlaces y nodos, considerando sus factores de impacto y probabilidad de aparici贸n. Para ello aprovechamos la redundancia de caminos de comunicaci贸n existentes, partiendo desde enfoques de encaminamiento adaptativos capaces de cumplir con las cuatro fases de la tolerancia a fallos: detecci贸n del error, contenci贸n del da帽o, recuperaci贸n del error, y tratamiento del fallo y continuidad del servicio. La experimentaci贸n muestra una degradaci贸n de prestaciones menor al 5%. En el futuro, se tratar谩 la p茅rdida de informaci贸n en tr谩nsito.L'煤s intensiu i perllongat de computadors d'altes prestacions per a executar aplicacions computacionalment intensives, sumat a l'elevat nombre d'elements que els componen, incrementen dr脿sticament la probabilitat d'ocurr猫ncia de fallades durant el seu funcionament. L'objectiu del treball 茅s resoldre el problema de toler脿ncia a fallades per a xarxes d'interconnexi贸 d'altes prestacions, partint del disseny de pol铆tiques d'encaminament tolerants a fallades. Busquem resoldre una determinada quantitat de fallades d'enlla莽os i nodes, considerant els seus factors d'impacte, probabilitat d'aparici贸. Per a aix貌 s'aprofita la redund脿ncia de camins de comunicaci贸 existents, partint des d'enfocaments d'encaminament adaptatius capa莽os de complir amb les quatre fases de la toler脿ncia a fallades: detecci贸 de l'error, contenci贸 del dany, recuperaci贸 de l'error, i tractament de la fallada i continu茂tat del servei. L'experimentaci贸 mostra una degradaci贸 de prestacions menor al 5%. En el futur, es tractar脿 la p猫rdua d'informaci贸 en tr脿nsit.The intensive and continous use of high-performance computers to execute computationally intensive applications, coupled with the large number of elements that make them up, dramatically increase the likelihood of failures during their operation. This works focuses on solving the problem of fault tolerance for high speed interconnection networks by means of designing fault tolerant routing policies. The goal is to solve a determined number of link and node failures, considering its impact factor and occurrence probability. To acomplish this task we take advantage of the communication path redundancy, through adaptive routing approaches that fulfils with the four phases of the fault tolerance: error detection, damage confinement and assessment, error recovery, fault treatment and continuous service. The experiments shows performance芒s degradation under 5%. In the future, we芒ll treat the loose of information in transit

    Doctor of Philosophy

    Get PDF
    dissertationOver the last decade, cyber-physical systems (CPSs) have seen significant applications in many safety-critical areas, such as autonomous automotive systems, automatic pilot avionics, wireless sensor networks, etc. A Cps uses networked embedded computers to monitor and control physical processes. The motivating example for this dissertation is the use of fault- tolerant routing protocol for a Network-on-Chip (NoC) architecture that connects electronic control units (Ecus) to regulate sensors and actuators in a vehicle. With a network allowing Ecus to communicate with each other, it is possible for them to share processing power to improve performance. In addition, networked Ecus enable flexible mapping to physical processes (e.g., sensors, actuators), which increases resilience to Ecu failures by reassigning physical processes to spare Ecus. For the on-chip routing protocol, the ability to tolerate network faults is important for hardware reconfiguration to maintain the normal operation of a system. Adding a fault-tolerance feature in a routing protocol, however, increases its design complexity, making it prone to many functional problems. Formal verification techniques are therefore needed to verify its correctness. This dissertation proposes a link-fault-tolerant, multiflit wormhole routing algorithm, and its formal modeling and verification using two different methodologies. An improvement upon the previously published fault-tolerant routing algorithm, a link-fault routing algorithm is proposed to relax the unrealistic node-fault assumptions of these algorithms, while avoiding deadlock conservatively by appropriately dropping network packets. This routing algorithm, together with its routing architecture, is then modeled in a process-algebra language LNT, and compositional verification techniques are used to verify its key functional properties. As a comparison, it is modeled using channel-level VHDL which is compiled to labeled Petri-nets (LPNs). Algorithms for a partial order reduction method on LPNs are given. An optimal result is obtained from heuristics that trace back on LPNs to find causally related enabled predecessor transitions. Key observations are made from the comparison between these two verification methodologies
    corecore