14 research outputs found

    An Analysis of Failure Handling in Chameleon, A Framework for Supporting Cost-Effective Fault Tolerant Services

    Get PDF
    The desire for low-cost reliable computing is increasing. Most current fault tolerant computing solutions are not very flexible, i.e., they cannot adapt to reliability requirements of newly emerging applications in business, commerce, and manufacturing. It is important that users have a flexible, reliable platform to support both critical and noncritical applications. Chameleon, under development at the Center for Reliable and High-Performance Computing at the University of Illinois, is a software framework. for supporting cost-effective adaptable networked fault tolerant service. This thesis details a simulation of fault injection, detection, and recovery in Chameleon. The simulation was written in C++ using the DEPEND simulation library. The results obtained from the simulation included the amount of overhead incurred by the fault detection and recovery mechanisms supported by Chameleon. In addition, information about fault scenarios from which Chameleon cannot recover was gained. The results of the simulation showed that both critical and noncritical applications can be executed in the Chameleon environment with a fairly small amount of overhead. No single point of failure from which Chameleon could not recover was found. Chameleon was also found to be capable of recovering from several multiple failure scenarios

    Designing Efficient Network Interfaces For System Area Networks

    Full text link
    The network is the key component of a Cluster of Workstations/PCs. Its performance, measured in terms of bandwidth and latency, has a great impact on the overall system performance. It quickly became clear that traditional WAN/LAN technology is not too well suited for interconnecting powerful nodes into a cluster. Their poor performance too often slows down communication-intensive applications. This observation led to the birth of a new class of networks called System Area Networks (SAN). The ATOLL network introduces a new optimized architecture for SANs. On a single chip, not one but four network interfaces (NI) have been implemented, together with an on-chip 4x4 full-duplex switch and four link interfaces. This unique "Network on a Chip" architecture is best suited for interconnecting SMP nodes, where multiple CPUs are given an exclusive NI and do not have to share a single interface. It also removes the need for any additional switching hardware, since the four byte-wide full-duplex links can be connected by cables with neighbor nodes in an arbitrary network topology

    Chameleon: A Software Infrastructure and Testbed for Reliable High-Speed Networked Computing

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNASA / NAG 1-61

    Ytelseanalyse av FRoots og Dimension-Order

    Get PDF
    Oppgaven har til hensikt å foreta en ytelseanalyse av to rutingalgoritmer; FRoots og Dimension-Order. Dimension-Order er egentlig ikke navnet på en spesifikk rutingalgoritme, men heller navnet på en kategori rutingalgoritmer som ruter på en spesiell måte. FRoots er den mer sofistikerte rutingalgoritmen og når oppgaven ble utdelt, trodde forfatter at han visste utfallet av ytelseanalysen. Algoritmene ble sammenlignet ved hjelp av en simulator (for øvrig utviklet ved institusjonen hvor jeg skrev oppgaven). Det var en simulator utviklet på J-Sim. For å ytelseanalysere måtte det brukes en topologi. En topologi sier oss noe om hvordan et nettverk med noder og linker er lagt ut fysisk. Det er brukt en 2D mesh med to størrelser: 4x4 og 8x8. Dette for å se om størrelsen på nettverket har noe å si for algoritmene. I tillegg blir simuleringene kjørt med forskjellige trafikkmønster (uniformt og parvis). Det uniforme trafikkmønsteret sier at en node kan kommunisere med alle andre noder under en simulering, mens det parvise sier at en node kun kan kommunisere med en node (den kan altså ikke skifte). Resultatet av analysen ble, i korthet, at FRoots yter best når det kjøres med parvis trafikkmønster, mens det er best å bruke Dimension-Order ruting ved uniformt. Størrelsen på nettverket har ingenting å si for utfallet

    Microkernel mechanisms for improving the trustworthiness of commodity hardware

    Full text link
    The thesis presents microkernel-based software-implemented mechanisms for improving the trustworthiness of computer systems based on commercial off-the-shelf (COTS) hardware that can malfunction when the hardware is impacted by transient hardware faults. The hardware anomalies, if undetected, can cause data corruptions, system crashes, and security vulnerabilities, significantly undermining system dependability. Specifically, we adopt the single event upset (SEU) fault model and address transient CPU or memory faults. We take advantage of the functional correctness and isolation guarantee provided by the formally verified seL4 microkernel and hardware redundancy provided by multicore processors, design the redundant co-execution (RCoE) architecture that replicates a whole software system (including the microkernel) onto different CPU cores, and implement two variants, loosely-coupled redundant co-execution (LC-RCoE) and closely-coupled redundant co-execution (CC-RCoE), for the ARM and x86 architectures. RCoE treats each replica of the software system as a state machine and ensures that the replicas start from the same initial state, observe consistent inputs, perform equivalent state transitions, and thus produce consistent outputs during error-free executions. Compared with other software-based error detection approaches, the distinguishing feature of RCoE is that the microkernel and device drivers are also included in redundant co-execution, significantly extending the sphere of replication (SoR). Based on RCoE, we introduce two kernel mechanisms, fingerprint validation and kernel barrier timeout, detecting fault-induced execution divergences between the replicated systems, with the flexibility of tuning the error detection latency and coverage. The kernel error-masking mechanisms built on RCoE enable downgrading from triple modular redundancy (TMR) to dual modular redundancy (DMR) without service interruption. We run synthetic benchmarks and system benchmarks to evaluate the performance overhead of the approach, observe that the overhead varies based on the characteristics of workloads and the variants (LC-RCoE or CC-RCoE), and conclude that the approach is applicable for real-world applications. The effectiveness of the error detection mechanisms is assessed by conducting fault injection campaigns on real hardware, and the results demonstrate compelling improvement

    Diseño de mecanismos eficientes para la gestión de subredes infiniband

    Get PDF
    El objetivo principal de esta tesis doctoral es contribuir al desarrollo de mecanismos de asimilación de cambios toplogicos para la arquitectura de red infiniband. En una primera fase, se ha diseñado y evaluado un primer prototipo de mecanismo de gestión. Su evaluación nos ha permitido identificar los principales cuellos de botella en el proceso de adaptación al cambio. A continuación, se han propuesto mecanismos optimizados para cada una de las tareas involucradas en dicho proceso: la detección del cambio topológico, la adquisición de la nueva topología de la red, el cómputo de nuevas rutas y la distribución de tables de encaminamiento actualizadas a los conmutadores de la red. El resultado es un mecanismo de gestión totalmente compatible con la especificación de infiniband, fácilmente implementable en sistemas comerciales, y casi transparente desde el punto de vista de las aplicaciones a las que da servicio la red

    Conceptual Model and Architecture of MAFTIA

    Get PDF
    This deliverable builds on the work reported in [MAFTIA 2000] and [Powell and Stroud 2001]. It contains a further refinement of the MAFTIA conceptual model and a revised discussion of the MAFTIA architecture. It also introduces the work done in MAFTIA on verification and assessment of security properties, which is reported on in more detail in [Adelsbach and Creese 2003
    corecore