14 research outputs found
An Analysis of Failure Handling in Chameleon, A Framework for Supporting Cost-Effective Fault Tolerant Services
The desire for low-cost reliable computing is increasing. Most current fault tolerant computing solutions are not very flexible, i.e., they cannot adapt to reliability requirements of newly emerging applications in business, commerce, and manufacturing. It is important that users have a flexible, reliable platform to support both critical and noncritical applications. Chameleon, under development at the Center for Reliable and High-Performance Computing at the University of Illinois, is a software framework. for supporting cost-effective adaptable networked fault tolerant service. This thesis details a simulation of fault injection, detection, and recovery in Chameleon. The simulation was written in C++ using the DEPEND simulation library. The results obtained from the simulation included the amount of overhead incurred by the fault detection and recovery mechanisms supported by Chameleon. In addition, information about fault scenarios from which Chameleon cannot recover was gained. The results of the simulation showed that both critical and noncritical applications can be executed in the Chameleon environment with a fairly small amount of overhead. No single point of failure from which Chameleon could not recover was found. Chameleon was also found to be capable of recovering from several multiple failure scenarios
Designing Efficient Network Interfaces For System Area Networks
The network is the key component of a Cluster of Workstations/PCs. Its performance, measured in terms of bandwidth and latency, has a great impact on the overall system performance. It quickly became clear that traditional WAN/LAN technology is not too well suited for interconnecting powerful nodes into a cluster. Their poor performance too often slows down communication-intensive applications. This observation led to the birth of a new class of networks called System Area Networks (SAN). The ATOLL network introduces a new optimized architecture for SANs. On a single chip, not one but four network interfaces (NI) have been implemented, together with an on-chip 4x4 full-duplex switch and four link interfaces. This unique "Network on a Chip" architecture is best suited for interconnecting SMP nodes, where multiple CPUs are given an exclusive NI and do not have to share a single interface. It also removes the need for any additional switching hardware, since the four byte-wide full-duplex links can be connected by cables with neighbor nodes in an arbitrary network topology
Chameleon: A Software Infrastructure and Testbed for Reliable High-Speed Networked Computing
Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNASA / NAG 1-61
Ytelseanalyse av FRoots og Dimension-Order
Oppgaven har til hensikt å foreta en ytelseanalyse av to rutingalgoritmer; FRoots og Dimension-Order. Dimension-Order er egentlig ikke navnet på en spesifikk rutingalgoritme, men heller navnet på en kategori rutingalgoritmer som ruter på en spesiell måte.
FRoots er den mer sofistikerte rutingalgoritmen og når oppgaven ble utdelt, trodde forfatter at han visste utfallet av ytelseanalysen.
Algoritmene ble sammenlignet ved hjelp av en simulator (for øvrig utviklet ved institusjonen hvor jeg skrev oppgaven). Det var en simulator utviklet på J-Sim.
For å ytelseanalysere måtte det brukes en topologi. En topologi sier oss noe om hvordan et nettverk med noder og linker er lagt ut fysisk.
Det er brukt en 2D mesh med to størrelser: 4x4 og 8x8. Dette for å se om størrelsen på nettverket har noe å si for algoritmene. I tillegg blir simuleringene kjørt med forskjellige trafikkmønster (uniformt og parvis). Det uniforme trafikkmønsteret sier at en node kan kommunisere med alle andre noder under en simulering, mens det parvise sier at en node kun kan kommunisere med en node (den kan altså ikke skifte).
Resultatet av analysen ble, i korthet, at FRoots yter best når det kjøres med parvis trafikkmønster, mens det er best å bruke Dimension-Order ruting ved uniformt. Størrelsen på nettverket har ingenting å si for utfallet
Microkernel mechanisms for improving the trustworthiness of commodity hardware
The thesis presents microkernel-based software-implemented mechanisms for improving the trustworthiness of computer systems based on commercial off-the-shelf (COTS) hardware that can malfunction when the hardware is impacted by transient hardware faults. The hardware anomalies, if undetected, can cause data corruptions, system crashes, and security vulnerabilities, significantly undermining system dependability. Specifically, we adopt the single event upset (SEU) fault model and address transient CPU or memory faults.
We take advantage of the functional correctness and isolation guarantee provided by the formally verified seL4 microkernel and hardware redundancy provided by multicore processors, design the redundant co-execution (RCoE) architecture that replicates a whole software system (including the microkernel) onto different CPU cores, and implement two variants, loosely-coupled redundant co-execution (LC-RCoE) and closely-coupled redundant co-execution (CC-RCoE), for the ARM and x86 architectures. RCoE treats each replica of the software system as a state machine and ensures that
the replicas start from the same initial state, observe consistent inputs, perform equivalent state transitions, and thus produce consistent outputs during error-free executions. Compared with other software-based error detection approaches, the distinguishing feature of RCoE is that the microkernel and device drivers are also included in redundant co-execution, significantly extending the sphere of replication (SoR).
Based on RCoE, we introduce two kernel mechanisms, fingerprint validation and kernel barrier timeout, detecting fault-induced execution divergences between the replicated systems, with the flexibility of tuning the error detection latency and coverage. The kernel error-masking mechanisms built on RCoE enable downgrading from triple modular redundancy (TMR) to dual modular redundancy (DMR) without service interruption. We run synthetic benchmarks and system benchmarks to evaluate the performance overhead of the approach, observe that the overhead varies based on the characteristics of workloads and the variants (LC-RCoE or CC-RCoE), and conclude that the approach is applicable for real-world applications. The effectiveness of the error detection mechanisms is assessed by conducting fault injection campaigns on real hardware, and the results demonstrate compelling improvement
Diseño de mecanismos eficientes para la gestión de subredes infiniband
El objetivo principal de esta tesis doctoral es contribuir al desarrollo de mecanismos de asimilación de cambios toplogicos para la arquitectura de red infiniband.
En una primera fase, se ha diseñado y evaluado un primer prototipo de mecanismo de gestión. Su evaluación nos ha permitido identificar los principales cuellos de botella en el proceso de adaptación al cambio.
A continuación, se han propuesto mecanismos optimizados para cada una de las tareas involucradas en dicho proceso: la detección del cambio topológico, la adquisición de la nueva topología de la red, el cómputo de nuevas rutas y la distribución de tables de encaminamiento actualizadas a los conmutadores de la red.
El resultado es un mecanismo de gestión totalmente compatible con la especificación de infiniband, fácilmente implementable en sistemas comerciales, y casi transparente desde el punto de vista de las aplicaciones a las que da servicio la red
Recommended from our members
Performance analysis and improvement of InfiniBand networks. Modelling and effective Quality-of-Service mechanisms for interconnection networks in cluster computing systems.
The InfiniBand Architecture (IBA) network has been proposed as a new
industrial standard with high-bandwidth and low-latency suitable for constructing
high-performance interconnected cluster computing systems. This architecture
replaces the traditional bus-based interconnection with a switch-based network for
the server Input-Output (I/O) and inter-processor communications. The efficient
Quality-of-Service (QoS) mechanism is fundamental to ensure the import at QoS
metrics, such as maximum throughput and minimum latency, leaving aside other
aspects like guarantee to reduce the delay, blocking probability, and mean queue
length, etc.
Performance modelling and analysis has been and continues to be of great
theoretical and practical importance in the design and development of
communication networks. This thesis aims to investigate efficient and cost-effective
QoS mechanisms for performance analysis and improvement of InfiniBand
networks in cluster-based computing systems.
Firstly, a rate-based source-response link-by-link admission and congestion
control function with improved Explicit Congestion Notification (ECN) packet
marking scheme is developed. This function adopts the rate control to reduce
congestion of multiple-class traffic. Secondly, a credit-based flow control scheme is
presented to reduce the mean queue length, throughput and response time of the system. In order to evaluate the performance of this scheme, a new queueing
network model is developed. Theoretical analysis and simulation experiments show
that these two schemes are quite effective and suitable for InfiniBand networks.
Finally, to obtain a thorough and deep understanding of the performance attributes
of InfiniBand Architecture network, two efficient threshold function flow control
mechanisms are proposed to enhance the QoS of InfiniBand networks; one is Entry
Threshold that sets the threshold for each entry in the arbitration table, and other is
Arrival Job Threshold that sets the threshold based on the number of jobs in each
Virtual Lane. Furthermore, the principle of Maximum Entropy is adopted to analyse
these two new mechanisms with the Generalized Exponential (GE)-Type
distribution for modelling the inter-arrival times and service times of the input traffic.
Extensive simulation experiments are conducted to validate the accuracy of the
analytical models
Conceptual Model and Architecture of MAFTIA
This deliverable builds on the work reported in [MAFTIA 2000] and [Powell and Stroud 2001]. It contains a further refinement of the MAFTIA conceptual model and a revised discussion of the MAFTIA architecture. It also introduces the work done in MAFTIA on verification and assessment of security properties, which is reported on in more detail in [Adelsbach and Creese 2003