30 research outputs found

    Keeping checkpoint/restart viable for exascale systems

    Get PDF
    Next-generation exascale systems, those capable of performing a quintillion operations per second, are expected to be delivered in the next 8-10 years. These systems, which will be 1,000 times faster than current systems, will be of unprecedented scale. As these systems continue to grow in size, faults will become increasingly common, even over the course of small calculations. Therefore, issues such as fault tolerance and reliability will limit application scalability. Current techniques to ensure progress across faults like checkpoint/restart, the dominant fault tolerance mechanism for the last 25 years, are increasingly problematic at the scales of future systems due to their excessive overheads. In this work, we evaluate a number of techniques to decrease the overhead of checkpoint/restart and keep this method viable for future exascale systems. More specifically, this work evaluates state-machine replication to dramatically increase the checkpoint interval (the time between successive checkpoints) and hash-based, probabilistic incremental checkpointing using graphics processing units to decrease the checkpoint commit time (the time to save one checkpoint). Using a combination of empirical analysis, modeling, and simulation, we study the costs and benefits of these approaches on a wide range of parameters. These results, which cover of number of high-performance computing capability workloads, different failure distributions, hardware mean time to failures, and I/O bandwidths, show the potential benefits of these techniques for meeting the reliability demands of future exascale platforms

    Agreement-related problems:from semi-passive replication to totally ordered broadcast

    Get PDF
    Agreement problems constitute a fundamental class of problems in the context of distributed systems. All agreement problems follow a common pattern: all processes must agree on some common decision, the nature of which depends on the specific problem. This dissertation mainly focuses on three important agreements problems: Replication, Total Order Broadcast, and Consensus. Replication is a common means to introduce redundancy in a system, in order to improve its availability. A replicated server is a server that is composed of multiple copies so that, if one copy fails, the other copies can still provide the service. Each copy of the server is called a replica. The replicas must all evolve in manner that is consistent with the other replicas. Hence, updating the replicated server requires that every replica agrees on the set of modifications to carry over. There are two principal replication schemes to ensure this consistency: active replication and passive replication. In Total Order Broadcast, processes broadcast messages to all processes. However, all messages must be delivered in the same order. Also, if one process delivers a message m, then all correct processes must eventually deliver m. The problem of Consensus gives an abstraction to most other agreement problems. All processes initiate a Consensus by proposing a value. Then, all processes must eventually decide the same value v that must be one of the proposed values. These agreement problems are closely related to each other. For instance, Chandra and Toueg [CT96] show that Total Order Broadcast and Consensus are equivalent problems. In addition, Lamport [Lam78] and Schneider [Sch90] show that active replication needs Total Order Broadcast. As a result, active replication is also closely related to the Consensus problem. The first contribution of this dissertation is the definition of the semi-passive replication technique. Semi-passive replication is a passive replication scheme based on a variant of Consensus (called Lazy Consensus and also defined here). From a conceptual point of view, the result is important as it helps to clarify the relation between passive replication and the Consensus problem. In practice, this makes it possible to design systems that react more quickly to failures. The problem of Total Order Broadcast is well-known in the field of distributed systems and algorithms. In fact, there have been already more than fifty algorithms published on the problem so far. Although quite similar, it is difficult to compare these algorithms as they often differ with respect to their actual properties, assumptions, and objectives. The second main contribution of this dissertation is to define five classes of total order broadcast algorithms, and to relate existing algorithms to those classes. The third contribution of this dissertation is to compare the expected performance of the various classes of total order broadcast algorithms. To achieve this goal, we define a set of metrics to predict the performance of distributed algorithms

    Third NASA Langley Formal Methods Workshop

    Get PDF
    This publication constitutes the proceedings of NASA Langley Research Center's third workshop on the application of formal methods to the design and verification of life-critical systems. This workshop brought together formal methods researchers, industry engineers, and academicians to discuss the potential of NASA-sponsored formal methods and to investigate new opportunities for applying these methods to industry problems. contained herein are copies of the material presented at the workshop, summaries of many of the presentations, a complete list of attendees, and a detailed summary of the Langley formal methods program. Much of this material is available electronically through the World-Wide Web via the following URL

    Defense in Depth of Resource-Constrained Devices

    Get PDF
    The emergent next generation of computing, the so-called Internet of Things (IoT), presents significant challenges to security, privacy, and trust. The devices commonly used in IoT scenarios are often resource-constrained with reduced computational strength, limited power consumption, and stringent availability requirements. Additionally, at least in the consumer arena, time-to-market is often prioritized at the expense of quality assurance and security. An initial lack of standards has compounded the problems arising from this rapid development. However, the explosive growth in the number and types of IoT devices has now created a multitude of competing standards and technology silos resulting in a highly fragmented threat model. Tens of billions of these devices have been deployed in consumers\u27 homes and industrial settings. From smart toasters and personal health monitors to industrial controls in energy delivery networks, these devices wield significant influence on our daily lives. They are privy to highly sensitive, often personal data and responsible for real-world, security-critical, physical processes. As such, these internet-connected things are highly valuable and vulnerable targets for exploitation. Current security measures, such as reactionary policies and ad hoc patching, are not adequate at this scale. This thesis presents a multi-layered, defense in depth, approach to preventing and mitigating a myriad of vulnerabilities associated with the above challenges. To secure the pre-boot environment, we demonstrate a hardware-based secure boot process for devices lacking secure memory. We introduce a novel implementation of remote attestation backed by blockchain technologies to address hardware and software integrity concerns for the long-running, unsupervised, and rarely patched systems found in industrial IoT settings. Moving into the software layer, we present a unique method of intraprocess memory isolation as a barrier to several prevalent classes of software vulnerabilities. Finally, we exhibit work on network analysis and intrusion detection for the low-power, low-latency, and low-bandwidth wireless networks common to IoT applications. By targeting these areas of the hardware-software stack, we seek to establish a trustworthy system that extends from power-on through application runtime

    Protocol composition frameworks and modular group communication:models, algorithms and architectures

    Get PDF
    It is noticeable that our society is increasingly relying on computer systems. Nowadays, computer networks can be found at places where it would have been unthinkable a few decades ago, supporting in some cases critical applications on which human lives may depend. Although this growing reliance on networked systems is generally perceived as technological progress, one should bear in mind that such systems are constantly growing in size and complexity, to such an extent that assuring their correct operation is sometimes a challenging task. Hence, dependability of distributed systems has become a crucial issue, and is responsible for an important body of research over the last years. No matter how much effort we put on ensuring our distributed system's correctness, we will be unable to prevent crashes. Therefore, designing distributed systems to tolerate rather than prevent such crashes is a reasonable approach. This is the purpose of fault-tolerance. Among all techniques that provide fault tolerance, replication is the only one that allows the system to mask process crashes. The intuition behind replication is simple: instead of having one instance of a service, we run several of them. If one of the replicas crashes, the rest can take over so that the crash does not prevent the system from delivering the expected service. A replicated service needs to keep all its replicas consistent, and group communication protocols provide abstractions to preserve such consistency. Group communication toolkits have been present since the late 80s. At the beginning, they were monolithic and later on they became modular. Modular group communication toolkits are composed of a set of off-the-shelf protocol modules that can be tailored to the application's needs. Composing protocols requires to set up basic rules that define how modules are composed and interact. Sometimes, these rules are devised exclusively for a particular protocol suite, but it is more sensible to agree on a carefully chosen set of rules and reuse them: this is the essence of protocol composition frameworks. There is a great diversity of protocol composition frameworks at present, and none is commonly considered the best. Furthermore, any attempt to defend a framework as being the best finds strong opposition with plenty of arguments pointing out its drawbacks. Given the complexity of current group communication toolkits and their configurability requirements, we believe that research on modular group communication and protocol composition frameworks must go hand-in-hand. The main goal of this thesis is to advance the state of the art in these two fields jointly and demonstrate how protocols can benefit from frameworks, as well as frameworks can benefit from protocols. The thesis is structured in three parts. Part I focuses on issues related to protocol composition frameworks. Part II is devoted to modular group communication. Finally, Part III presents our modular group communication prototype: Fortika. Part III combines the results of the two previous parts, thereby acting as the convergence point. At the beginning of Part I, we propose four perspectives to describe and compare frameworks on which we base our research on protocol frameworks. These perspectives are: composition model (how the composition looks like), interaction model (how the components interact), concurrency model (how concurrency is managed within the framework), and interaction with the environment (how the framework communicates with the outside world). We compare Appia and Cactus, two relevant protocol composition frameworks with a very different design. Overall, we cannot tell which framework is better. However, a thorough comparison using the four perspectives mentioned above showed that Appia is better in certain aspects, while Cactus is better in other aspects. Concurrency control to avoid race conditions and deadlocks should be ensured by the protocol framework. However this is not always the case. We survey the concurrency model of eight protocol composition frameworks and propose new features to improve concurrency management. Events are the basic mechanism that protocol modules use to communicate with each other. Most protocol composition frameworks include events at the core of their interaction model. However, events are seemingly not as good as one may expect. We point out the drawbacks of events and propose an alternative interaction scheme that uses message headers instead of events: the header-driven model. Part II starts by discussing common features of traditional group communication toolkits and the problems they entail. Then, a new modular group communication architecture is presented. It is less complex, more powerful, and more responsive to failures than traditional architectures. Crash-recovery is a model where crashed processes can be restarted and continue where they were executing just before they crashed. This requires to log the state to disk periodically. We argue that current specifications of atomic broadcast (an important group communication primitive) are not satisfactory. We propose a novel specification that intends to overcome the problems we spotted in existing specifications. Additionally, we come up with two implementations of our atomic broadcast specification and compare their performance. Fortika is the main prototype of the thesis, and the subject of Part III. Fortika is a group communication toolkit written in Java that can use third-party frameworks like Cactus or Appia for composition. Fortika was the testbed for architectures, models and algorithms proposed in the thesis. Finally, we performed software-based fault injection on Fortika to assess its fault-tolerance. The results were valuable to improve the design of Fortika

    Understanding and Identifying Vulnerabilities Related to Architectural Security Tactics

    Get PDF
    To engineer secure software systems, software architects elicit the system\u27s security requirements to adopt suitable architectural solutions. They often make use of architectural security tactics when designing the system\u27s security architecture. Security tactics are reusable solutions to detect, resist, recover from, and react to attacks. Since security tactics are the building blocks of a security architecture, flaws in the adoption of these tactics, their incorrect implementation, or their deterioration during software maintenance activities can lead to vulnerabilities, which we refer to as tactical vulnerabilities . Although security tactics and their correct adoption/implementation are crucial elements to achieve security, prior works have not investigated the architectural context of vulnerabilities. Therefore, this dissertation presents a research work whose major goals are: (i) to identify common types of tactical vulnerabilities, (ii) to investigate tactical vulnerabilities through in-depth empirical studies, and (iii) to develop a technique that detects tactical vulnerabilities caused by object deserialization. First, we introduce the Common Architectural Weakness Enumeration (CAWE), which is a catalog that enumerates 223 tactical vulnerability types. Second, we use this catalog to conduct an empirical study using vulnerability reports from large-scale open-source systems. Among our findings, we observe that Improper Input Validation was the most reoccurring vulnerability type. This tactical vulnerability type is caused by not properly implementing the Validate Inputs tactic. Although prior research focused on devising automated (or semi-automated) techniques for detecting multiple instances of improper input validation (e.g., SQL Injection and Cross-Site Scripting) one of them got neglected, which is the untrusted deserialization of objects. Unlike other input validation problems, object deserialization vulnerabilities exhibit a set of characteristics that are hard to handle for effective vulnerability detection. We currently lack a robust approach that can detect untrusted deserialization problems. Hence, this dissertation introduces DODO untrusteD ObjectDeserialization detectOr), a novel program analysis technique to detect deserialization vulnerabilities. DODO encompasses a sound static analysis of the program to extract potentially vulnerable paths, an exploit generation engine, and a dynamic analysis engine to verify the existence of untrusted object deserialization. Our experiments showed that DODO can successfully infer possible vulnerabilities that could arise at runtime during object deserialization

    Exploiting Host Availability in Distributed Systems.

    Full text link
    As distributed systems become more decentralized, fluctuating host availability is an increasingly disruptive phenomenon. Older systems such as AFS used a small number of well-maintained, highly available machines to coordinate access to shared client state; server uptime (and thus service availability) were expected to be high. Newer services scale to larger number of clients by increasing the number of servers. In these systems, the responsibility for maintaining the service abstraction is spread amongst thousands of machines. In the extreme, each client is also a server who must respond to requests from its peers, and each host can opt in or out of the system at any time. In these operating environments, a non-trivial fraction of servers will be unavailable at any give time. This diffusion of responsibility from a few dedicated hosts to many unreliable ones has a dramatic impact on distributed system design, since it is difficult to build robust applications atop a partially available, potentially untrusted substrate. This dissertation explores one aspect of this challenge: how can a distributed system measure the fluctuating availability of its constituent hosts, and how can it use an understanding of this churn to improve performance and security? This dissertation extends the previous literature in three ways. First, it introduces new analytical techniques for characterizing availability data, applying these techniques to several real networks and explaining the distinct uptime patterns found within. Second, this dissertation introduces new methods for predicting future availability, both at the granularity of individual hosts and clusters of hosts. Third, my dissertation describes how to use these new techniques to improve the performance and security of distributed systems.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/58445/1/jmickens_1.pd

    Integrated helicopter survivability

    Get PDF
    A high level of survivability is important to protect military personnel and equipment and is central to UK defence policy. Integrated Survivability is the systems engineering methodology to achieve optimum survivability at an affordable cost, enabling a mission to be completed successfully in the face of a hostile environment. “Integrated Helicopter Survivability” is an emerging discipline that is applying this systems engineering approach within the helicopter domain. Philosophically the overall survivability objective is ‘zero attrition’, even though this is unobtainable in practice. The research question was: “How can helicopter survivability be assessed in an integrated way so that the best possible level of survivability can be achieved within the constraints and how will the associated methods support the acquisition process?” The research found that principles from safety management could be applied to the survivability problem, in particular reducing survivability risk to as low as reasonably practicable (ALARP). A survivability assessment process was developed to support this approach and was linked into the military helicopter life cycle. This process positioned the survivability assessment methods and associated input data derivation activities. The system influence diagram method was effective at defining the problem and capturing the wider survivability interactions, including those with the defence lines of development (DLOD). Influence diagrams and Quality Function Deployment (QFD) methods were effective visual tools to elicit stakeholder requirements and improve communication across organisational and domain boundaries. The semi-quantitative nature of the QFD method leads to numbers that are not real. These results are suitable for helping to prioritise requirements early in the helicopter life cycle, but they cannot provide the quantifiable estimate of risk needed to demonstrate ALARP. The probabilistic approach implemented within the Integrated Survivability Assessment Model (ISAM) was developed to provide a quantitative estimate of ‘risk’ to support the approach of reducing survivability risks to ALARP. Limitations in available input data for the rate of encountering threats leads to a probability of survival that is not a real number that can be used to assess actual loss rates. However, the method does support an assessment across platform options, provided that the ‘test environment’ remains consistent throughout the assessment. The survivability assessment process and ISAM have been applied to an acquisition programme, where they have been tested to support the survivability decision making and design process. The survivability ‘test environment’ is an essential element of the survivability assessment process and is required by integrated survivability tools such as ISAM. This test environment, comprising of threatening situations that span the complete spectrum of helicopter operations requires further development. The ‘test environment’ would be used throughout the helicopter life cycle from selection of design concepts through to test and evaluation of delivered solutions. It would be updated as part of the through life capability management (TLCM) process. A framework of survivability analysis tools requires development that can provide probabilistic input data into ISAM and allow derivation of confidence limits. This systems level framework would be capable of informing more detailed survivability design work later in the life cycle and could be enabled through a MATLAB¼ based approach. Survivability is an emerging system property that influences the whole system capability. There is a need for holistic capability level analysis tools that quantify survivability along with other influencing capabilities such as: mobility (payload / range), lethality, situational awareness, sustainability and other mission capabilities. It is recommended that an investigation of capability level analysis methods across defence should be undertaken to ensure a coherent and compliant approach to systems engineering that adopts best practice from across the domains. Systems dynamics techniques should be considered for further use by Dstl and the wider MOD, particularly within the survivability and operational analysis domains. This would improve understanding of the problem space, promote a more holistic approach and enable a better balance of capability, within which survivability is one essential element. There would be value in considering accidental losses within a more comprehensive ‘survivability’ analysis. This approach would enable a better balance to be struck between safety and survivability risk mitigations and would lead to an improved, more integrated overall design
    corecore