23 research outputs found

    Keeping checkpoint/restart viable for exascale systems

    Get PDF
    Next-generation exascale systems, those capable of performing a quintillion operations per second, are expected to be delivered in the next 8-10 years. These systems, which will be 1,000 times faster than current systems, will be of unprecedented scale. As these systems continue to grow in size, faults will become increasingly common, even over the course of small calculations. Therefore, issues such as fault tolerance and reliability will limit application scalability. Current techniques to ensure progress across faults like checkpoint/restart, the dominant fault tolerance mechanism for the last 25 years, are increasingly problematic at the scales of future systems due to their excessive overheads. In this work, we evaluate a number of techniques to decrease the overhead of checkpoint/restart and keep this method viable for future exascale systems. More specifically, this work evaluates state-machine replication to dramatically increase the checkpoint interval (the time between successive checkpoints) and hash-based, probabilistic incremental checkpointing using graphics processing units to decrease the checkpoint commit time (the time to save one checkpoint). Using a combination of empirical analysis, modeling, and simulation, we study the costs and benefits of these approaches on a wide range of parameters. These results, which cover of number of high-performance computing capability workloads, different failure distributions, hardware mean time to failures, and I/O bandwidths, show the potential benefits of these techniques for meeting the reliability demands of future exascale platforms

    Third NASA Langley Formal Methods Workshop

    Get PDF
    This publication constitutes the proceedings of NASA Langley Research Center's third workshop on the application of formal methods to the design and verification of life-critical systems. This workshop brought together formal methods researchers, industry engineers, and academicians to discuss the potential of NASA-sponsored formal methods and to investigate new opportunities for applying these methods to industry problems. contained herein are copies of the material presented at the workshop, summaries of many of the presentations, a complete list of attendees, and a detailed summary of the Langley formal methods program. Much of this material is available electronically through the World-Wide Web via the following URL

    Agreement-related problems:from semi-passive replication to totally ordered broadcast

    Get PDF
    Agreement problems constitute a fundamental class of problems in the context of distributed systems. All agreement problems follow a common pattern: all processes must agree on some common decision, the nature of which depends on the specific problem. This dissertation mainly focuses on three important agreements problems: Replication, Total Order Broadcast, and Consensus. Replication is a common means to introduce redundancy in a system, in order to improve its availability. A replicated server is a server that is composed of multiple copies so that, if one copy fails, the other copies can still provide the service. Each copy of the server is called a replica. The replicas must all evolve in manner that is consistent with the other replicas. Hence, updating the replicated server requires that every replica agrees on the set of modifications to carry over. There are two principal replication schemes to ensure this consistency: active replication and passive replication. In Total Order Broadcast, processes broadcast messages to all processes. However, all messages must be delivered in the same order. Also, if one process delivers a message m, then all correct processes must eventually deliver m. The problem of Consensus gives an abstraction to most other agreement problems. All processes initiate a Consensus by proposing a value. Then, all processes must eventually decide the same value v that must be one of the proposed values. These agreement problems are closely related to each other. For instance, Chandra and Toueg [CT96] show that Total Order Broadcast and Consensus are equivalent problems. In addition, Lamport [Lam78] and Schneider [Sch90] show that active replication needs Total Order Broadcast. As a result, active replication is also closely related to the Consensus problem. The first contribution of this dissertation is the definition of the semi-passive replication technique. Semi-passive replication is a passive replication scheme based on a variant of Consensus (called Lazy Consensus and also defined here). From a conceptual point of view, the result is important as it helps to clarify the relation between passive replication and the Consensus problem. In practice, this makes it possible to design systems that react more quickly to failures. The problem of Total Order Broadcast is well-known in the field of distributed systems and algorithms. In fact, there have been already more than fifty algorithms published on the problem so far. Although quite similar, it is difficult to compare these algorithms as they often differ with respect to their actual properties, assumptions, and objectives. The second main contribution of this dissertation is to define five classes of total order broadcast algorithms, and to relate existing algorithms to those classes. The third contribution of this dissertation is to compare the expected performance of the various classes of total order broadcast algorithms. To achieve this goal, we define a set of metrics to predict the performance of distributed algorithms

    Defense in Depth of Resource-Constrained Devices

    Get PDF
    The emergent next generation of computing, the so-called Internet of Things (IoT), presents significant challenges to security, privacy, and trust. The devices commonly used in IoT scenarios are often resource-constrained with reduced computational strength, limited power consumption, and stringent availability requirements. Additionally, at least in the consumer arena, time-to-market is often prioritized at the expense of quality assurance and security. An initial lack of standards has compounded the problems arising from this rapid development. However, the explosive growth in the number and types of IoT devices has now created a multitude of competing standards and technology silos resulting in a highly fragmented threat model. Tens of billions of these devices have been deployed in consumers\u27 homes and industrial settings. From smart toasters and personal health monitors to industrial controls in energy delivery networks, these devices wield significant influence on our daily lives. They are privy to highly sensitive, often personal data and responsible for real-world, security-critical, physical processes. As such, these internet-connected things are highly valuable and vulnerable targets for exploitation. Current security measures, such as reactionary policies and ad hoc patching, are not adequate at this scale. This thesis presents a multi-layered, defense in depth, approach to preventing and mitigating a myriad of vulnerabilities associated with the above challenges. To secure the pre-boot environment, we demonstrate a hardware-based secure boot process for devices lacking secure memory. We introduce a novel implementation of remote attestation backed by blockchain technologies to address hardware and software integrity concerns for the long-running, unsupervised, and rarely patched systems found in industrial IoT settings. Moving into the software layer, we present a unique method of intraprocess memory isolation as a barrier to several prevalent classes of software vulnerabilities. Finally, we exhibit work on network analysis and intrusion detection for the low-power, low-latency, and low-bandwidth wireless networks common to IoT applications. By targeting these areas of the hardware-software stack, we seek to establish a trustworthy system that extends from power-on through application runtime

    Protocol composition frameworks and modular group communication:models, algorithms and architectures

    Get PDF
    It is noticeable that our society is increasingly relying on computer systems. Nowadays, computer networks can be found at places where it would have been unthinkable a few decades ago, supporting in some cases critical applications on which human lives may depend. Although this growing reliance on networked systems is generally perceived as technological progress, one should bear in mind that such systems are constantly growing in size and complexity, to such an extent that assuring their correct operation is sometimes a challenging task. Hence, dependability of distributed systems has become a crucial issue, and is responsible for an important body of research over the last years. No matter how much effort we put on ensuring our distributed system's correctness, we will be unable to prevent crashes. Therefore, designing distributed systems to tolerate rather than prevent such crashes is a reasonable approach. This is the purpose of fault-tolerance. Among all techniques that provide fault tolerance, replication is the only one that allows the system to mask process crashes. The intuition behind replication is simple: instead of having one instance of a service, we run several of them. If one of the replicas crashes, the rest can take over so that the crash does not prevent the system from delivering the expected service. A replicated service needs to keep all its replicas consistent, and group communication protocols provide abstractions to preserve such consistency. Group communication toolkits have been present since the late 80s. At the beginning, they were monolithic and later on they became modular. Modular group communication toolkits are composed of a set of off-the-shelf protocol modules that can be tailored to the application's needs. Composing protocols requires to set up basic rules that define how modules are composed and interact. Sometimes, these rules are devised exclusively for a particular protocol suite, but it is more sensible to agree on a carefully chosen set of rules and reuse them: this is the essence of protocol composition frameworks. There is a great diversity of protocol composition frameworks at present, and none is commonly considered the best. Furthermore, any attempt to defend a framework as being the best finds strong opposition with plenty of arguments pointing out its drawbacks. Given the complexity of current group communication toolkits and their configurability requirements, we believe that research on modular group communication and protocol composition frameworks must go hand-in-hand. The main goal of this thesis is to advance the state of the art in these two fields jointly and demonstrate how protocols can benefit from frameworks, as well as frameworks can benefit from protocols. The thesis is structured in three parts. Part I focuses on issues related to protocol composition frameworks. Part II is devoted to modular group communication. Finally, Part III presents our modular group communication prototype: Fortika. Part III combines the results of the two previous parts, thereby acting as the convergence point. At the beginning of Part I, we propose four perspectives to describe and compare frameworks on which we base our research on protocol frameworks. These perspectives are: composition model (how the composition looks like), interaction model (how the components interact), concurrency model (how concurrency is managed within the framework), and interaction with the environment (how the framework communicates with the outside world). We compare Appia and Cactus, two relevant protocol composition frameworks with a very different design. Overall, we cannot tell which framework is better. However, a thorough comparison using the four perspectives mentioned above showed that Appia is better in certain aspects, while Cactus is better in other aspects. Concurrency control to avoid race conditions and deadlocks should be ensured by the protocol framework. However this is not always the case. We survey the concurrency model of eight protocol composition frameworks and propose new features to improve concurrency management. Events are the basic mechanism that protocol modules use to communicate with each other. Most protocol composition frameworks include events at the core of their interaction model. However, events are seemingly not as good as one may expect. We point out the drawbacks of events and propose an alternative interaction scheme that uses message headers instead of events: the header-driven model. Part II starts by discussing common features of traditional group communication toolkits and the problems they entail. Then, a new modular group communication architecture is presented. It is less complex, more powerful, and more responsive to failures than traditional architectures. Crash-recovery is a model where crashed processes can be restarted and continue where they were executing just before they crashed. This requires to log the state to disk periodically. We argue that current specifications of atomic broadcast (an important group communication primitive) are not satisfactory. We propose a novel specification that intends to overcome the problems we spotted in existing specifications. Additionally, we come up with two implementations of our atomic broadcast specification and compare their performance. Fortika is the main prototype of the thesis, and the subject of Part III. Fortika is a group communication toolkit written in Java that can use third-party frameworks like Cactus or Appia for composition. Fortika was the testbed for architectures, models and algorithms proposed in the thesis. Finally, we performed software-based fault injection on Fortika to assess its fault-tolerance. The results were valuable to improve the design of Fortika

    Abstracts on Radio Direction Finding (1899 - 1995)

    Get PDF
    The files on this record represent the various databases that originally composed the CD-ROM issue of "Abstracts on Radio Direction Finding" database, which is now part of the Dudley Knox Library's Abstracts and Selected Full Text Documents on Radio Direction Finding (1899 - 1995) Collection. (See Calhoun record https://calhoun.nps.edu/handle/10945/57364 for further information on this collection and the bibliography). Due to issues of technological obsolescence preventing current and future audiences from accessing the bibliography, DKL exported and converted into the three files on this record the various databases contained in the CD-ROM. The contents of these files are: 1) RDFA_CompleteBibliography_xls.zip [RDFA_CompleteBibliography.xls: Metadata for the complete bibliography, in Excel 97-2003 Workbook format; RDFA_Glossary.xls: Glossary of terms, in Excel 97-2003 Workbookformat; RDFA_Biographies.xls: Biographies of leading figures, in Excel 97-2003 Workbook format]; 2) RDFA_CompleteBibliography_csv.zip [RDFA_CompleteBibliography.TXT: Metadata for the complete bibliography, in CSV format; RDFA_Glossary.TXT: Glossary of terms, in CSV format; RDFA_Biographies.TXT: Biographies of leading figures, in CSV format]; 3) RDFA_CompleteBibliography.pdf: A human readable display of the bibliographic data, as a means of double-checking any possible deviations due to conversion

    Integrated helicopter survivability

    Get PDF
    A high level of survivability is important to protect military personnel and equipment and is central to UK defence policy. Integrated Survivability is the systems engineering methodology to achieve optimum survivability at an affordable cost, enabling a mission to be completed successfully in the face of a hostile environment. “Integrated Helicopter Survivability” is an emerging discipline that is applying this systems engineering approach within the helicopter domain. Philosophically the overall survivability objective is ‘zero attrition’, even though this is unobtainable in practice. The research question was: “How can helicopter survivability be assessed in an integrated way so that the best possible level of survivability can be achieved within the constraints and how will the associated methods support the acquisition process?” The research found that principles from safety management could be applied to the survivability problem, in particular reducing survivability risk to as low as reasonably practicable (ALARP). A survivability assessment process was developed to support this approach and was linked into the military helicopter life cycle. This process positioned the survivability assessment methods and associated input data derivation activities. The system influence diagram method was effective at defining the problem and capturing the wider survivability interactions, including those with the defence lines of development (DLOD). Influence diagrams and Quality Function Deployment (QFD) methods were effective visual tools to elicit stakeholder requirements and improve communication across organisational and domain boundaries. The semi-quantitative nature of the QFD method leads to numbers that are not real. These results are suitable for helping to prioritise requirements early in the helicopter life cycle, but they cannot provide the quantifiable estimate of risk needed to demonstrate ALARP. The probabilistic approach implemented within the Integrated Survivability Assessment Model (ISAM) was developed to provide a quantitative estimate of ‘risk’ to support the approach of reducing survivability risks to ALARP. Limitations in available input data for the rate of encountering threats leads to a probability of survival that is not a real number that can be used to assess actual loss rates. However, the method does support an assessment across platform options, provided that the ‘test environment’ remains consistent throughout the assessment. The survivability assessment process and ISAM have been applied to an acquisition programme, where they have been tested to support the survivability decision making and design process. The survivability ‘test environment’ is an essential element of the survivability assessment process and is required by integrated survivability tools such as ISAM. This test environment, comprising of threatening situations that span the complete spectrum of helicopter operations requires further development. The ‘test environment’ would be used throughout the helicopter life cycle from selection of design concepts through to test and evaluation of delivered solutions. It would be updated as part of the through life capability management (TLCM) process. A framework of survivability analysis tools requires development that can provide probabilistic input data into ISAM and allow derivation of confidence limits. This systems level framework would be capable of informing more detailed survivability design work later in the life cycle and could be enabled through a MATLAB® based approach. Survivability is an emerging system property that influences the whole system capability. There is a need for holistic capability level analysis tools that quantify survivability along with other influencing capabilities such as: mobility (payload / range), lethality, situational awareness, sustainability and other mission capabilities. It is recommended that an investigation of capability level analysis methods across defence should be undertaken to ensure a coherent and compliant approach to systems engineering that adopts best practice from across the domains. Systems dynamics techniques should be considered for further use by Dstl and the wider MOD, particularly within the survivability and operational analysis domains. This would improve understanding of the problem space, promote a more holistic approach and enable a better balance of capability, within which survivability is one essential element. There would be value in considering accidental losses within a more comprehensive ‘survivability’ analysis. This approach would enable a better balance to be struck between safety and survivability risk mitigations and would lead to an improved, more integrated overall design

    Mining a Small Medical Data Set by Integrating the Decision Tree and t-test

    Get PDF
    [[abstract]]Although several researchers have used statistical methods to prove that aspiration followed by the injection of 95% ethanol left in situ (retention) is an effective treatment for ovarian endometriomas, very few discuss the different conditions that could generate different recovery rates for the patients. Therefore, this study adopts the statistical method and decision tree techniques together to analyze the postoperative status of ovarian endometriosis patients under different conditions. Since our collected data set is small, containing only 212 records, we use all of these data as the training data. Therefore, instead of using a resultant tree to generate rules directly, we use the value of each node as a cut point to generate all possible rules from the tree first. Then, using t-test, we verify the rules to discover some useful description rules after all possible rules from the tree have been generated. Experimental results show that our approach can find some new interesting knowledge about recurrent ovarian endometriomas under different conditions.[[journaltype]]國外[[incitationindex]]EI[[booktype]]紙本[[countrycodes]]FI

    Discrete Event Simulations

    Get PDF
    Considered by many authors as a technique for modelling stochastic, dynamic and discretely evolving systems, this technique has gained widespread acceptance among the practitioners who want to represent and improve complex systems. Since DES is a technique applied in incredibly different areas, this book reflects many different points of view about DES, thus, all authors describe how it is understood and applied within their context of work, providing an extensive understanding of what DES is. It can be said that the name of the book itself reflects the plurality that these points of view represent. The book embraces a number of topics covering theory, methods and applications to a wide range of sectors and problem areas that have been categorised into five groups. As well as the previously explained variety of points of view concerning DES, there is one additional thing to remark about this book: its richness when talking about actual data or actual data based analysis. When most academic areas are lacking application cases, roughly the half part of the chapters included in this book deal with actual problems or at least are based on actual data. Thus, the editor firmly believes that this book will be interesting for both beginners and practitioners in the area of DES
    corecore