2,243 research outputs found

    Alpha Entanglement Codes: Practical Erasure Codes to Archive Data in Unreliable Environments

    Full text link
    Data centres that use consumer-grade disks drives and distributed peer-to-peer systems are unreliable environments to archive data without enough redundancy. Most redundancy schemes are not completely effective for providing high availability, durability and integrity in the long-term. We propose alpha entanglement codes, a mechanism that creates a virtual layer of highly interconnected storage devices to propagate redundant information across a large scale storage system. Our motivation is to design flexible and practical erasure codes with high fault-tolerance to improve data durability and availability even in catastrophic scenarios. By flexible and practical, we mean code settings that can be adapted to future requirements and practical implementations with reasonable trade-offs between security, resource usage and performance. The codes have three parameters. Alpha increases storage overhead linearly but increases the possible paths to recover data exponentially. Two other parameters increase fault-tolerance even further without the need of additional storage. As a result, an entangled storage system can provide high availability, durability and offer additional integrity: it is more difficult to modify data undetectably. We evaluate how several redundancy schemes perform in unreliable environments and show that alpha entanglement codes are flexible and practical codes. Remarkably, they excel at code locality, hence, they reduce repair costs and become less dependent on storage locations with poor availability. Our solution outperforms Reed-Solomon codes in many disaster recovery scenarios.Comment: The publication has 12 pages and 13 figures. This work was partially supported by Swiss National Science Foundation SNSF Doc.Mobility 162014, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN

    Redundant disk arrays: Reliable, parallel secondary storage

    Get PDF
    During the past decade, advances in processor and memory technology have given rise to increases in computational performance that far outstrip increases in the performance of secondary storage technology. Coupled with emerging small-disk technology, disk arrays provide the cost, volume, and capacity of current disk subsystems, by leveraging parallelism, many times their performance. Unfortunately, arrays of small disks may have much higher failure rates than the single large disks they replace. Redundant arrays of inexpensive disks (RAID) use simple redundancy schemes to provide high data reliability. The data encoding, performance, and reliability of redundant disk arrays are investigated. Organizing redundant data into a disk array is treated as a coding problem. Among alternatives examined, codes as simple as parity are shown to effectively correct single, self-identifying disk failures

    Artificial Gravity in Mars Orbit for Crew Acclimation

    Get PDF
    NASAs current baseline plan for a crewed Mars mission anticipates a transit time of up to three hundred days in microgravity and 3-14 days on the Martian surface for gravity acclimation before the crew can safely perform their first Extra-Vehicular Activity (EVA). While there are multiple options for how initial surface operations will be performed, all current designs involve acclimation on the surface, and the impacts on the mission schedule, required supplies, and crew lander systems are significant. This paper proposes an alternative option utilizing artificial gravity, which offers benefits in terms of mission scope, mass savings, crew health, and long-term strategic vision. By moving the acclimation requirement to the orbiting habitats existing systems, rather than adding redundant systems to the lander, the Mars Descent Vehicle (MDV) can be a much smaller, simpler, and lighter design. Rather than the lander being designed to support crew for days, it would be mere hours. While ambitious, the concept of pre-acclimation in orbit can be not only safe and feasible, but done with fairly minimal changes to the planned architecture and overall mass requirements. The data used draws on decades of established research and demonstrates how this capability can be not only used for pre-acclimation, but also to support crew during early orbital-only missions, surface abort contingency scenarios, return-to-orbit abort scenarios, and as an early proof of capability into larger and more ambitious artificial gravity designs needed for extended exploration missions in the future

    Index to 1981 NASA Tech Briefs, volume 6, numbers 1-4

    Get PDF
    Short announcements of new technology derived from the R&D activities of NASA are presented. These briefs emphasize information considered likely to be transferrable across industrial, regional, or disciplinary lines and are issued to encourage commercial application. This index for 1981 Tech Briefs contains abstracts and four indexes: subject, personal author, originating center, and Tech Brief Number. The following areas are covered: electronic components and circuits, electronic systems, physical sciences, materials, life sciences, mechanics, machinery, fabrication technology, and mathematics and information sciences

    What broke where for distributed and parallel applications — a whodunit story

    Get PDF
    Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed and parallel systems is a difficult task. These large distributed and parallel systems are composed of various complex software and hardware components. When the system experiences some performance or correctness problem, developers struggle to understand the root cause of the problem and fix in a timely manner. In my thesis, I address these three components of the performance problems in computer systems. First, we focus on diagnosing performance problems in large-scale parallel applications running on supercomputers. We developed techniques to localize the performance problem for root-cause analysis. Parallel applications, most of which are complex scientific simulations running in supercomputers, can create up to millions of parallel tasks that run on different machines and communicate using the message passing paradigm. We developed a highly scalable and accurate automated debugging tool called PRODOMETER, which uses sophisticated algorithms to first, create a logical progress dependency graph of the tasks to highlight how the problem spread through the system manifesting as a system-wide performance issue. Second, uses this logical progress dependence graph to identify the task where the problem originated. Finally, PRODOMETER pinpoints the code region corresponding to the origin of the bug. Second, we developed a tool-chain that can detect performance anomaly using machine-learning techniques and can achieve very low false positive rate. Our input-aware performance anomaly detection system consists of a scalable data collection framework to collect performance related metrics from different granularity of code regions, an offline model creation and prediction-error characterization technique, and a threshold based anomaly-detection-engine for production runs. Our system requires few training runs and can handle unknown inputs and parameter combinations by dynamically calibrating the anomaly detection threshold according to the characteristics of the input data and the characteristics of the prediction-error of the models. Third, we developed performance problem mitigation scheme for erasure-coded distributed storage systems. Repair operations of the failed blocks in erasure-coded distributed storage system take really long time in networked constrained data-centers. The reason being, during the repair operation for erasure-coded distributed storage, a lot of data from multiple nodes are gathered into a single node and then a mathematical operation is performed to reconstruct the missing part. This process severely congests the links toward the destination where newly recreated data is to be hosted. We proposed a novel distributed repair technique, called Partial-Parallel-Repair (PPR) that performs this reconstruction in parallel on multiple nodes and eliminates network bottlenecks, and as a result, greatly speeds up the repair process. Fourth, we study how for a class of applications, performance can be improved (or performance problems can be mitigated) by selectively approximating some of the computations. For many applications, the main computation happens inside a loop that can be logically divided into a few temporal segments, we call phases. We found that while approximating the initial phases might severely degrade the quality of the results, approximating the computation for the later phases have very small impact on the final quality of the result. Based on this observation, we developed an optimization framework that for a given budget of quality-loss, would find the best approximation settings for each phase in the execution

    A Concept of Operations for an Integrated Vehicle Health Assurance System

    Get PDF
    This document describes a Concept of Operations (ConOps) for an Integrated Vehicle Health Assurance System (IVHAS). This ConOps is associated with the Maintain Vehicle Safety (MVS) between Major Inspections Technical Challenge in the Vehicle Systems Safety Technologies (VSST) Project within NASA s Aviation Safety Program. In particular, this document seeks to describe an integrated system concept for vehicle health assurance that integrates ground-based inspection and repair information with in-flight measurement data for airframe, propulsion, and avionics subsystems. The MVS Technical Challenge intends to maintain vehicle safety between major inspections by developing and demonstrating new integrated health management and failure prevention technologies to assure the integrity of vehicle systems between major inspection intervals and maintain vehicle state awareness during flight. The approach provided by this ConOps is intended to help optimize technology selection and development, as well as allow the initial integration and demonstration of these subsystem technologies over the 5 year span of the VSST program, and serve as a guideline for developing IVHAS technologies under the Aviation Safety Program within the next 5 to 15 years. A long-term vision of IVHAS is provided to describe a basic roadmap for more intelligent and autonomous vehicle systems

    A HOLISTIC REDUNDANCY- AND INCENTIVE-BASED FRAMEWORK TO IMPROVE CONTENT AVAILABILITY IN PEER-TO-PEER NETWORKS

    Get PDF
    Peer-to-Peer (P2P) technology has emerged as an important alternative to the traditional client-server communication paradigm to build large-scale distributed systems. P2P enables the creation, dissemination and access to information at low cost and without the need of dedicated coordinating entities. However, existing P2P systems fail to provide high-levels of content availability, which limit their applicability and adoption. This dissertation takes a holistic approach to device mechanisms to improve content availability in large-scale P2P systems. Content availability in P2P can be impacted by hardware failures and churn. Hardware failures, in the form of disk or node failures, render information inaccessible. Churn, an inherent property of P2P, is the collective effect of the users’ uncoordinated behavior, which occurs when a large percentage of nodes join and leave frequently. Such a behavior reduces content availability significantly. Mitigating the combined effect of hardware failures and churn on content availability in P2P requires new and innovative solutions that go beyond those applied in existing distributed systems. To addresses this challenge, the thesis proposes two complementary, low cost mechanisms, whereby nodes self-organize to overcome failures and improve content availability. The first mechanism is a low complexity and highly flexible hybrid redundancy scheme, referred to as Proactive Repair (PR). The second mechanism is an incentive-based scheme that promotes cooperation and enforces fair exchange of resources among peers. These mechanisms provide the basis for the development of distributed self-organizing algorithms to automate PR and, through incentives, maximize their effectiveness in realistic P2P environments. Our proposed solution is evaluated using a combination of analytical and experimental methods. The analytical models are developed to determine the availability and repair cost properties of PR. The results indicate that PR’s repair cost outperforms other redundancy schemes. The experimental analysis was carried out using simulation and the development of a testbed. The simulation results confirm that PR improves content availability in P2P. The proposed mechanisms are implemented and tested using a DHT-based P2P application environment. The experimental results indicate that the incentive-based mechanism can promote fair exchange of resources and limits the impact of uncooperative behaviors such as “free-riding”

    Neutral Networks of Real-World Programs and their Application to Automated Software Evolution

    Get PDF
    The existing software development ecosystem is the product of evolutionary forces, and consequently real-world software is amenable to improvement through automated evolutionary techniques. This dissertation presents empirical evidence that software is inherently robust to small randomized program transformations, or \u27mutations. Simple and general mutation operations are demonstrated that can be applied to software source code, compiled assembler code, or directly to binary executables. These mutations often generate variants of working programs that differ significantly from the original, yet remain fully functional. Applying successive mutations to the same software program uncovers large \u27neutral networks\u27 of fully functional variants of real-world software projects. These properties of \u27mutational robustness\u27 and the corresponding \u27neutral networks\u27 have been studied extensively in biology and are believed to be related to the capacity for unsupervised evolution and adaptation. As in biological systems, mutational robustness and neutral networks in software systems enable automated evolution. The dissertation presents several applications that leverage software neutral networks to automate common software development and maintenance tasks. Neutral networks are explored to generate diverse implementations of software for improving runtime security and for proactively repairing latent bugs. Next, a technique is introduced for automatically repairing bugs in the assembler and executables compiled from off-the-shelf software. As demonstration, a proprietary executable is manipulated to patch security vulnerabilities without access to source code or any aid from the software vendor. Finally, software neutral networks are leveraged to optimize complex nonfunctional runtime properties. This optimization technique is used to reduce the energy consumption of the popular PARSEC benchmark applications by 20% as compared to the best available public domain compiler optimizations. The applications presented herein apply evolutionary computation techniques to existing software using common software engineering tools. By enabling evolutionary techniques within the existing software development toolchain, this work is more likely to be of practical benefit to the developers and maintainers of real-world software systems
    • …
    corecore