8 research outputs found

    Files as first-class objects in fault -tolerant concurrent systems

    Get PDF
    Concurrent systems are used in applications where multiple processors are needed to complete tasks within a reasonable amount of time, or where the data sets involved will not fit within the main memory of a single computer. Because of their reliance on multiple machines, such systems are proportionally more vulnerable to both hardware and software induced failures. Fault-tolerance schemes are used to recover some earlier consistent state of the system after such a failure.;One important technique used to achieve fault-tolerance is checkpointing and rollback-recovery. In this thesis, we present a method for efficiently and transparently incorporating the part of the process state contained in the file system into process checkpoints, and we show how recovery of consistent versions of the file system and processes may be done after a failure. We present the details of a prototype system which implements our method.;We show that by using the special properties of the log-structured file system, the class of programs which are amenable to checkpointing and rollback-recovery schemes can be expanded to include those that use files. We impose no a priori restriction on the types of file system operations that can be done, and we demonstrate that our scheme does not impose significant failure-free overhead on the computation

    Enabling system survival across hypervisor failures

    Get PDF
    Dissertação de mestrado em Engenharia Eletrónica Industrial e ComputadoresEmbedded system’s evolution is notorious and due to the complexity growth, these systems possess more general purpose behaviour instead of its original single purpose features. Naturally, virtualization started to impact this matter. This technology decreases the hardware costs since it allows to run several software components on the same hardware. Although virtualization begun as a pure software layer, many companies started to provide hardware solutions to assist it. Despite ARM TrustZone technology being a security extension, many developers realized that it was possible to use this extension to support development of hypervisors. With TrustZone, hypervisors can ensure one of the most important features in virtualization: isolation between guests. However, this hardware technology revealed some vulnerabilities and since the whole system is TrustZone dependent, the virtualization can be compromised. To address this problem, this thesis proposes an hybrid software/hardware mechanism to handle failures of TrustZone-based hypervisors. By using the processor’s abort exceptions and hash keys, this project detects system malfunctions caused by imperfect designs or even deliberate attacks. Additionally, it provides a restoration model by checkpoints which allows a system recovery without major throwbacks. The implemented solution was deployed on TrustZone-based LTZVisor, an open-source and in-house hypervisor, and the revealed results are appealing. With a 6.5% memory footprint increase and in the worst case scenario, an increment of 23% in context switching time, it is possible to detect secure memory invasions and recover the system. Despite of the hypervisor memory footprint increment and latency addition, the reliability and availability that the system bring to the LTZVisor are unquestionable.A evolução dos sistemas embebidos é notória e, devido ao aumento da sua complexidade, estes sistemas cada vez mais possuem um comportamento de propósito geral, em vez das suas características originais de propósito único. Naturalmente, a virtualização começou a ter impacto sobre este meio, uma vez que permite executar vários componentes de software no mesmo hardware, diminuindo os custos de hardware. Embora a virtualização tenha começado como uma camada de software pura, muitas empresas começaram a fornecer soluções de hardware para auxiliá-lo. Apesar da TrustZone ter sido projetada pela ARM para ser uma extensão de segurança, muitos desenvolvedores perceberam que era possível usá-la para suporte ao desenvolvimento de hipervisores. Com a TrustZone, os hipervisores podem garantir uma das premissas mais importantes da virtualização: isolamento entre hóspedes. No entanto, esta tecnologia de hardware revelou algumas vulnerabilidades e, sendo todo o sistema dependente da TrustZone, a virtualização pode ficar comprometida. Para solucionar o problema, esta tese propõe um mecanismo híbrido de software/ hardware para lidar com as falhas em hipervisores baseados em TrustZone. Usando as excepções do processador e chaves de hash, este projecto detecta defeitos no sistema causados por imperfeições no design e também ataques intencionais. Além disso, este fornece um modelo de restauração por pontos de verificação, permitindo uma recuperação do sistema sem grandes retrocessos. A solução foi implementada no LTZVisor, um hipervisor em código aberto e desenvolvido no ESRG, sendo que os resultados revelados são satisfatórios. Com um aumento de 6,5% da memória usada e um incremento, no pior caso, de 23% no tempo de troca de contexto, é possível detectar invasões de memória segura e recuperar o sistema. Apesar do incremento de memória do hypervisor e da adição de latência, a confiabilidade e a disponibilidade que o sistema oferece ao LTZVisor são inquestionáveis

    Self-adaptivity of applications on network on chip multiprocessors: the case of fault-tolerant Kahn process networks

    Get PDF
    Technology scaling accompanied with higher operating frequencies and the ability to integrate more functionality in the same chip has been the driving force behind delivering higher performance computing systems at lower costs. Embedded computing systems, which have been riding the same wave of success, have evolved into complex architectures encompassing a high number of cores interconnected by an on-chip network (usually identified as Multiprocessor System-on-Chip). However these trends are hindered by issues that arise as technology scaling continues towards deep submicron scales. Firstly, growing complexity of these systems and the variability introduced by process technologies make it ever harder to perform a thorough optimization of the system at design time. Secondly, designers are faced with a reliability wall that emerges as age-related degradation reduces the lifetime of transistors, and as the probability of defects escaping post-manufacturing testing is increased. In this thesis, we take on these challenges within the context of streaming applications running in network-on-chip based parallel (not necessarily homogeneous) systems-on-chip that adopt the no-remote memory access model. In particular, this thesis tackles two main problems: (1) fault-aware online task remapping, (2) application-level self-adaptation for quality management. For the former, by viewing fault tolerance as a self-adaptation aspect, we adopt a cross-layer approach that aims at graceful performance degradation by addressing permanent faults in processing elements mostly at system-level, in particular by exploiting redundancy available in multi-core platforms. We propose an optimal solution based on an integer linear programming formulation (suitable for design time adoption) as well as heuristic-based solutions to be used at run-time. We assess the impact of our approach on the lifetime reliability. We propose two recovery schemes based on a checkpoint-and-rollback and a rollforward technique. For the latter, we propose two variants of a monitor-controller- adapter loop that adapts application-level parameters to meet performance goals. We demonstrate not only that fault tolerance and self-adaptivity can be achieved in embedded platforms, but also that it can be done without incurring large overheads. In addressing these problems, we present techniques which have been realized (depending on their characteristics) in the form of a design tool, a run-time library or a hardware core to be added to the basic architecture

    ADDING PERSISTENCE TO MAIN MEMORY PROGRAMMING

    Get PDF
    Unlocking the true potential of the new persistent memories (PMEMs) requires eliminating traditional persistent I/O abstractions altogether, by introducing persistent semantics directly into main memory programming. Such a programming model elevates failure atomicity to a first-class application property in addition to in-memory data layout, concurrency-control, and fault tolerance, and therefore requires redesign of programming abstractions for both program correctness and maximum performance gains. To address these challenges, this thesis proposes a set of system software designs that integrate persistence with main memory programming, and makes the following contributions. First, this thesis proposes a PMEM-aware I/O runtime, NVStream, that supports fast durable streaming I/O. NVStream uses a memory-based I/O interface that integrates with existing I/O data movement operations of an application to accelerate persistent data writes. NVStream carefully designs its persistent data storage layout and crash-consistent semantics to match both application and PMEM characteristics. Specifically, we leverage the streaming nature of I/O in HPC workflows, to benefit from using a log-structured PMEM storage engine design, that uses relaxed write orderings and append-only failure-atomic semantics to form strongly consistent application checkpoints. Furthermore, we identify that optimizing the I/O software stack exposes the PMEM bandwidth limitations as a bottleneck during parallel HPC I/O writes, and propose a novel data movement design – PHX. PHX uses alternative network data movement paths available in datacenters to ease up the bandwidth pressure on the PMEM memory interconnects, all while maintaining the correctness of the persistent data. Next, the thesis explores the challenges and opportunities of using PMEM for true main memory persistent programming – a single data domain for both runtime and persistent applicationstate. Such a programming model includes maintaining ACID properties during each and every update to applications persistent structures. ACID-qualified persistent programming for multi-threaded applications is hard, as the programmer has to reason about both crash-consistency and synchronization – crash-sync – semantics for programming correctness. The thesis contributes new understanding of the correctness requirements for mixing different crash-consistent and synchronization protocols, characterizes the performance of different crash-sync realizations for different applications and hardware architectures, and draws actionable insights for future designs of PMEM systems. Finally, the application state stored on node-local persistent memory is still vulnerable to catastrophic node failures. The thesis proposes a replicated persistent memory runtime, Blizzard, that supports truly fault tolerant, concurrent and persistent data-structure programming. Blizzard carefully integrates userspace networking with byte addressable PMEM for a fast, persistent memory replication runtime. The design also incorporates a replication-aware crash-sync protocol that supports consistent and concurrent updates on persistent data-structures. Blizzard offers applications the flexibility to use the data structures that best match their functional requirements, while offering better performance, and providing crucial reliability guarantees lacking from existing persistent memory runtimes.Ph.D

    Payload operations control center network (POCCNET) systems definition phase study report

    Get PDF
    The results of the studies performed during the systems definition phase of POCCNET are presented. The concept of POCCNET as a system of standard POCCs is described and an analysis of system requirements is also included. Alternative systems concepts were evaluated as well as various methods for development of reliable reusable software. A number of POCC application areas, such as command management, on board computer support, and simulation were also studied. Other areas of investigation included the operation of POCCNET systems, the facility requirements and usage

    Integrated Software Architecture-Based Reliability Prediction for IT Systems

    Get PDF
    With the increasing importance of reliability in business and industrial IT systems, new techniques for architecture-based software reliability prediction are becoming an integral part of the development process. This dissertation thesis introduces a novel reliability modelling and prediction technique that considers the software architecture with its component structure, control and data flow, recovery mechanisms, its deployment to distributed hardware resources and the system\u27s usage profile

    Integrated Software Architecture-Based Reliability Prediction for IT Systems

    Get PDF
    With the increasing importance of reliability in business and industrial IT systems, new techniques for architecture-based software reliability prediction are becoming an integral part of the development process. This dissertation thesis introduces a novel reliability modelling and prediction technique that considers the software architecture with its component structure, control and data flow, recovery mechanisms, its deployment to distributed hardware resources and the system´s usage profile

    Aspect-oriented technology for dependable operating systems

    Get PDF
    Modern computer devices exhibit transient hardware faults that disturb the electrical behavior but do not cause permanent physical damage to the devices. Transient faults are caused by a multitude of sources, such as fluctuation of the supply voltage, electromagnetic interference, and radiation from the natural environment. Therefore, dependable computer systems must incorporate methods of fault tolerance to cope with transient faults. Software-implemented fault tolerance represents a promising approach that does not need expensive hardware redundancy for reducing the probability of failure to an acceptable level. This thesis focuses on software-implemented fault tolerance for operating systems because they are the most critical pieces of software in a computer system: All computer programs depend on the integrity of the operating system. However, the C/C++ source code of common operating systems tends to be already exceedingly complex, so that a manual extension by fault tolerance is no viable solution. Thus, this thesis proposes a generic solution based on Aspect-Oriented Programming (AOP). To evaluate AOP as a means to improve the dependability of operating systems, this thesis presents the design and implementation of a library of aspect-oriented fault-tolerance mechanisms. These mechanisms constitute separate program modules that can be integrated automatically into common off-the-shelf operating systems using a compiler for the AOP language. Thus, the aspect-oriented approach facilitates improving the dependability of large-scale software systems without affecting the maintainability of the source code. The library allows choosing between several error-detection and error-correction schemes, and provides wait-free synchronization for handling asynchronous and multi-threaded operating-system code. This thesis evaluates the aspect-oriented approach to fault tolerance on the basis of two off-the-shelf operating systems. Furthermore, the evaluation also considers one user-level program for protection, as the library of fault-tolerance mechanisms is highly generic and transparent and, thus, not limited to operating systems. Exhaustive fault-injection experiments show an excellent trade-off between runtime overhead and fault tolerance, which can be adjusted and optimized by fine-grained selective placement of the fault-tolerance mechanisms. Finally, this thesis provides evidence for the effectiveness of the approach in detecting and correcting radiation-induced hardware faults: High-energy particle radiation experiments confirm improvements in fault tolerance by almost 80 percent
    corecore