28 research outputs found

    Doctor of Philosophy

    Get PDF
    dissertationCurrent scaling trends in transistor technology, in pursuit of larger component counts and improving power efficiency, are making the hardware increasingly less reliable. Due to extreme transistor miniaturization, it is becoming easier to flip a bit stored in memory elements built using these transistors. Given that soft errors can cause transient bit-flips in memory elements, caused due to alpha particles and cosmic rays striking those elements, soft errors have become one of the major impediments in system resilience as we move towards exascale computing. Soft errors escaping the hardware-layer may silently corrupt the runtime application data of a program, causing silent data corruption in the output. Also, given that soft errors are transient in nature, it is notoriously hard to trace back their origins. Therefore, techniques to enhance system resilience hinge on the availability of efficient error detectors that have high detection rates, low false positive rates, and lower computational overhead. It is equally important to have a flexible infrastructure capable of simulating realistic soft error models to promote an effective evaluation of newly developed error detectors. In this work, we present a set of techniques for efficiently detecting soft errors affecting control-flow, data, and structured address computations in an application. We evaluate the efficacy of the proposed techniques by evaluating them on a collection of benchmarks through fault-injection driven studies. As an important requirement, we also introduce two new LLVM-based fault injectors, KULFI and VULFI, which are geared towards scalar and vector architectures, respectively. Through this work, we aim to make contributions to the system resilience community by making our research tools (in the form of error detectors and fault injectors) publicly available

    An enhanced index-based checkpointing algorithm for distributed systems

    Get PDF
    Rollback-recovery in distributed systems is important for fault-tolerant computing. Without fault tolerance mechanisms, an application running on a system has to be restarted from scratch if a fault happens in the middle of its execution, resulting in loss of useful computation. To provide efficient rollback-recovery for fault-tolerance in distributed systems, it is significant to reduce the number of checkpoints under the existence of consistent global checkpoints in index-based distributed checkpointing algorithms. Because of the dependencies among the processes states that induced by inter-process communication in distributed systems, asynchronous checkpointing may suffer from the domino effect. Therefore, a consistent global checkpoint should always be ensured to restrict the rollback distance. The quasi-synchronous checkpointing protocols achieve synchronization in a loose fashion. Index-based checkpointing algorithm is a kind of typical quasi- synchronous checkpointing mechanism. The algorithm proposed in this thesis follows a new strategy to update the checkpoint interval dynamically as opposed to the static interval used by the existing algorithms explained in the previous chapter. Whenever a process takes a forced checkpoint due to the reception of a message with sequence number higher than the sequence number of the process, the checkpoint interval is either reset or the next basic checkpoint is skipped depending on when the massage has been received. The simulation is built on SPIN, a tool to trace logical design errors and check the logical consistency of protocols and algorithms in distributed systems. Simulation results show that the proposed scheme can reduce the number of induced forced-checkpoints per message 27- 32% on an average as compared to the traditional strategies

    Improving the Performance of Big Data Analytics Platforms by Task and I/O Granularity Adjustment

    Get PDF
    Department of Computer Science and EngineeringWith the massive increase in the amount of semi-structured and unstructured web data, big data analytics platforms have emerged and started to evolve rapidly. Apache Hadoop has been developed for batch processing on a large dataset, and systems for interactive and general purpose applications have been developed alongside NoSQL databases. Numerous efforts have been made to improve the performance of Hadoop and NoSQL databases, including utilizing a new device called NVMM for NoSQL databases. Nonetheless, their performance is still far from satisfactory due to inadequate granularity for tasks and I/O. In this dissertation, we present novel techniques to improve the performance of Apache Hadoop and NVMM-based LSM-tree by adjusting task and I/O granularity. First, we analyze YARN container overhead and present dynamic input split size adjustment scheme, which can logically combine multiple HDFS blocks and increase the input size of each container, thereby enabling a single map wave and reducing the number of containers and their initialization overhead. Experimental results shows that we can avoid recurring container overhead by selecting the right size for input splits and reducing the number of containers. Second, we present a novel HDFS block coalescing scheme that mitigates the YARN con tainer overhead. Our assorted block coalescing scheme combines multiple HDFS blocks and creates large input splits of various sizes, reducing the number of containers and their initializa tion overhead. Our experimental study shows the block coalescing scheme significantly reduces the container overhead while it achieves good load balancing and job scheduling fairness without impairing the degree of overlap between map phase and reduce phase. Third, we discuss design choice of using NVMM for indexing structure in NoSQL databases and present ZipperDB, a key-value store that redesigns LSM-tree for byte-addressable persistent memory. To benefit from the byte-addressability of persistent memory, ZipperDB employs byte addressable persistent SkipLists and performs Zipper Compaction, a novel in-place compaction algorithm that merges two adjacent persistent SkipLists without compromising the failure atomicity. The byte-addressable compaction helps mitigate the write amplification problem, which is known to be the root cause of the write stall problem in LSM-tree. Finally, we present ListDB, a write-optimized key-value store for NVMM to overcome the gap between DRAM and NVMM write latencies and thereby, resolve the write stall problem. ListDB consists of three novel techniques: (i) byte-addressable Index-Unified Logging, which incrementally converts write-ahead logs into SkipLists, (ii) Braided SkipList, a simple NUMA aware SkipList that effectively reduces the NUMA effects of NVMM, and (iii) NUMA-aware Zipper Compaction. Using the three techniques, ListDB makes background flush and com paction fast enough to resolve the infamous write stall problem and shows 1.6x and 25x higher write throughputs than PACTree and Intel Pmem-RocksDB, respectively.ope

    Safe and automatic live update

    Get PDF
    Tanenbaum, A.S. [Promotor

    Holistic System Design for Deterministic Replay.

    Full text link
    Deterministic replay systems record and reproduce the execution of a hardware or software system. While it is well known how to replay uniprocessor systems, it is much harder to provide deterministic replay of shared memory multithreaded programs on multiprocessors because shared memory accesses add a high-frequency source of non-determinism. This thesis proposes efficient multiprocessor replay systems: Respec, Chimera, and Rosa. Respec is an operating-system-based replay system. Respec is based on the observation that most program executions are data-race-free and for programs with no data races it is sufficient to record program input and the happens-before order of synchronization operations for replay. Respec speculates that a program is data-race-free and supports rollback and recovery from misspeculation. For racy programs, Respec employs a cheap runtime check that compares system call outputs and memory/register states of recorded and replayed processes at a semi-regular interval. Chimera uses a sound static data race detector to find all potential data races and instrument pairs of potentially racing instructions to transform an arbitrary program to make it data-race-free. Then, Chimera records only the non-deterministic inputs and the order of synchronization operations for replay. However, existing static data race detectors generate excessive false warnings, leading to high recording overhead. Chimera resolves this problem by employing a combination of profiling, symbolic analysis, and dynamic checks that target the sources of imprecision in the static data race detector. Rosa is a processor-based ultra-low overhead (less than one percent) replay solution that requires very little hardware support as it essentially only needs a log of cache misses to reproduce a multiprocessor execution. Unlike previous hardware-assisted systems, Rosa does not record shared memory dependencies at all. Instead, it infers them offline using a Satisfiability Modulo Theories (SMT) solver. Our offline analysis is capable of inferring interleavings that are legal under the Sequentially Consistency (SC) and Total Store Order (TSO) memory models.PhDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/102374/1/dongyoon_1.pd

    Détecter et survivre aux intrusions : exploration de nouvelles approches de détection, de restauration, et de réponse aux intrusions

    Get PDF
    Computing platforms, such as embedded systems or laptops, are built with layers of preventive security mechanisms to reduce the likelihood of attackers successfully compromising them. Nevertheless, given time and despite decades of improvements in preventive security, intrusions still happen. Therefore, systems should expect intrusions to occur, thus they should be built to detect and to survive them. Commodity Operating Systems (OSs) are deployed with intrusion detection solutions, but their ability to survive them is limited. State-of-the-art approaches from industry or academia either involve manual procedures, loss of availability, coarse-grained responses, or non-negligible performance overhead. Moreover, low-level components, such as the BIOS, are increasingly targeted by sophisticated attackers to implant stealthy and resilient malware. State-of-the-art solutions, however, mainly focus on boot time integrity, leaving the runtime part of the BIOS—known as the System Management Mode (SMM)—a prime target. This dissertation shows that we can build platforms that detect intrusions at the BIOS level and survive intrusions at the OS level. First, by demonstrating that intrusion survivability is a viable approach for commodity OSs. We develop a new approach that address various limitations from the literature, and we evaluate its security and performance. Second, by developing a hardware-based approach that detects attacks at the BIOS level where we demonstrate its feasibility with multiple detection methods.Les systèmes informatiques, tels que les ordinateurs portables ou les systèmes embarqués, sont construits avec des couches de mécanismes de sécurité préventifs afin de réduire la probabilité qu'un attaquant les compromettent. Néanmoins, malgré des décennies d'avancées dans ce domaine, des intrusions surviennent toujours. Par conséquent, nous devons supposer que des intrusions auront lieu et nous devons construire nos systèmes afin qu'ils puissent les détecter et y survivre. Les systèmes d'exploitation généralistes sont déployés avec des mécanismes de détection d'intrusion, mais leur capacité à survivre à une intrusion est limitée. Les solutions de l'état de l'art nécessitent des procédures manuelles, comportent des pertes de disponibilité, ou font subir un fort coût en performance. De plus, les composants de bas niveau tels que le BIOS sont de plus en plus la cible d'attaquants cherchant à implanter des logiciels malveillants, furtifs, et résilients. Bien que des solutions de l'état de l'art garantissent l'intégrité de ces composants au démarrage, peu s'intéressent à la sécurité des services fournis par le BIOS qui sont exécutés au sein du System Management Mode (SMM). Ce manuscrit montre que nous pouvons construire des systèmes capables de détecter des intrusions au niveau du BIOS et y survivre au niveau du système d'exploitation. Tout d'abord, nous démontrons qu'une approche de survivabilité aux intrusions est viable et praticable pour des systèmes d'exploitation généralistes. Ensuite, nous démontrons qu'il est possible de détecter des intrusions au niveau du BIOS avec une solution basée sur du matériel

    Automated intrusion recovery for web applications

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (pages 93-97).In this dissertation, we develop recovery techniques for web applications and demonstrate that automated recovery from intrusions and user mistakes is practical as well as effective. Web applications play a critical role in users' lives today, making them an attractive target for attackers. New vulnerabilities are routinely found in web application software, and even if the software is bug-free, administrators may make security mistakes such as misconfiguring permissions; these bugs and mistakes virtually guarantee that every application will eventually be compromised. To clean up after a successful attack, administrators need to find its entry point, track down its effects, and undo the attack's corruptions while preserving legitimate changes. Today this is all done manually, which results in days of wasted effort with no guarantee that all traces of the attack have been found or that no legitimate changes were lost. To address this problem, we propose that automated intrusion recovery should be an integral part of web application platforms. This work develops several ideas-retroactive patching, automated UI replay, dependency tracking, patch-based auditing, and distributed repair-that together recover from past attacks that exploited a vulnerability, by retroactively fixing the vulnerability and repairing the system state to make it appear as if the vulnerability never existed. Repair tracks down and reverts effects of the attack on other users within the same application and on other applications, while preserving legitimate changes. Using techniques resulting from these ideas, an administrator can easily recover from past attacks that exploited a bug using nothing more than a patch fixing the bug, with no manual effort on her part to find the attack or track its effects. The same techniques can also recover from attacks that exploit past configuration mistakes-the administrator only has to point out the past request that resulted in the mistake. We built three prototype systems, WARP, POIROT, and AIRE, to explore these ideas. Using these systems, we demonstrate that we can recover from challenging attacks in real distributed web applications with little or no changes to application source code; that recovery time is a fraction of the original execution time for attacks with a few affected requests; and that support for recovery adds modest runtime overhead during the application's normal operation.by Ramesh Chandra.Ph.D
    corecore