680 research outputs found
Recommended from our members
Fault diversity among off-the-shelf SQL database servers
Fault tolerance is often the only viable way of obtaining the required system dependability from systems built out of "off-the-shelf" (OTS) products. We have studied a sample of bug reports from four off-the-shelf SQL servers so as to estimate the possible advantages of software fault tolerance - in the form of modular redundancy with diversity - in complex off-the-shelf software. We checked whether these bugs would cause coincident failures in more than one of the servers. We found that very few bugs affected two of the four servers, and none caused failures in more than two. We also found that only four of these bugs would cause identical, undetectable failures in two servers. Therefore, a fault-tolerant server, built with diverse off-the-shelf servers, seems to have a good chance of delivering improvements in availability and failure rates compared with the individual off-the-shelf servers or their replicated, nondiverse configurations
Recommended from our members
Fault tolerance via diversity for off-the-shelf products: A study with SQL database servers
If an off-the-shelf software product exhibits poor dependability due to design faults, then software fault tolerance is often the only way available to users and system integrators to alleviate the problem. Thanks to low acquisition costs, even using multiple versions of software in a parallel architecture, which is a scheme formerly reserved for few and highly critical applications, may become viable for many applications. We have studied the potential dependability gains from these solutions for off-the-shelf database servers. We based the study on the bug reports available for four off-the-shelf SQL servers plus later releases of two of them. We found that many of these faults cause systematic noncrash failures, which is a category ignored by most studies and standard implementations of fault tolerance for databases. Our observations suggest that diverse redundancy would be effective for tolerating design faults in this category of products. Only in very few cases would demands that triggered a bug in one server cause failures in another one, and there were no coincident failures in more than two of the servers. Use of different releases of the same product would also tolerate a significant fraction of the faults. We report our results and discuss their implications, the architectural options available for exploiting them, and the difficulties that they may present
Intrusion Tolerance: Concepts and Design Principles. A Tutorial
In traditional dependability, fault tolerance has been the workhorse of the many solutions published over the years. Classical security-related work has on the other hand privileged, with few exceptions, intrusion prevention, or intrusion detection without systematic forms of processing the intrusion symptoms. A new approach has slowly emerged during the past decade, and gained impressive momentum recently: intrusion tolerance. The purpose of this tutorial is to explain the underlying concepts and design principles. The tutorial reviews previous results under the light of intrusion tolerance (IT), introduces the fundamental ideas behind IT, and presents recent advances of the state-of-the-art, coming from European and US research efforts devoted to IT. The program of the tutorial will address: a review of the dependability and security background; introduction of the fundamental concepts of intrusion tolerance (IT); intrusion-aware fault models; intrusion prevention; intrusion detection; IT strategies and mechanisms; design methodologies for IT systems; examples of IT systems and protocol
Managing the Future Internet through Intelligent In-Network Substrates
The current Internet has been founded on the architectural premise of a simple network service used to interconnect relatively intelligent end systems. While this simplicity allowed it to reach an impressive scale, the predictive manner in which ISP networks are currently planned and configured through external management systems and the uniform treatment of all traffic are hampering its use as a unifying multi-service network. The future Internet will need to be more intelligent and adaptive, optimizing continuously the use of its resources and recovering from transient problems, faults and attacks without any impact on the demanding services and applications running over it. This article describes an architecture that allows intelligence to be introduced within the network to support sophisticated self-management functionality in a coordinated and controllable manner. The presented approach, based on intelligent substrates, can potentially make the Internet more adaptable, agile, sustainable, and dependable given the requirements of emerging services with highly demanding traffic and rapidly changing locations. We discuss how the proposed framework can be applied to three representative emerging scenarios: dynamic traffic engineering (load balancing across multiple paths); energy efficiency in ISP network infrastructures; and cache management in content-centric networks
Nature-inspired survivability: Prey-inspired survivability countermeasures for cloud computing security challenges
As cloud computing environments become complex, adversaries have become highly sophisticated and unpredictable. Moreover, they can easily increase attack power and persist longer before detection. Uncertain malicious actions, latent risks, Unobserved or Unobservable risks (UUURs) characterise this new threat domain. This thesis proposes prey-inspired survivability to address unpredictable security challenges borne out of UUURs. While survivability is a well-addressed phenomenon in non-extinct prey animals, applying prey survivability to cloud computing directly is challenging due to contradicting end goals. How to manage evolving survivability goals and requirements under contradicting environmental conditions adds to the challenges. To address these challenges, this thesis proposes a holistic taxonomy which integrate multiple and disparate perspectives of cloud security challenges. In addition, it proposes the TRIZ (Teorija Rezbenija Izobretatelskib Zadach) to derive prey-inspired solutions through resolving contradiction. First, it develops a 3-step process to facilitate interdomain transfer of
concepts from nature to cloud. Moreover, TRIZ’s generic approach suggests specific
solutions for cloud computing survivability. Then, the thesis presents the conceptual prey-inspired cloud computing survivability framework (Pi-CCSF), built upon TRIZ derived solutions. The framework run-time is pushed to the user-space to support evolving survivability design goals. Furthermore, a target-based decision-making technique (TBDM) is proposed to manage survivability decisions. To evaluate the prey-inspired survivability concept, Pi-CCSF simulator is developed and implemented. Evaluation results shows that escalating survivability actions improve the vitality of vulnerable and compromised virtual machines (VMs) by 5% and dramatically improve their overall survivability. Hypothesis testing conclusively supports the hypothesis that the escalation mechanisms can be applied to enhance the survivability of cloud computing systems. Numeric analysis of TBDM shows that by considering survivability preferences and attitudes (these directly impacts survivability actions), the TBDM method brings unpredictable survivability information closer to decision processes. This enables efficient execution of variable escalating survivability actions, which enables the Pi-CCSF’s decision
system (DS) to focus upon decisions that achieve survivability outcomes under unpredictability imposed by UUUR
Engineering Resilient Space Systems
Several distinct trends will influence space exploration missions in the next decade. Destinations are
becoming more remote and mysterious, science questions more sophisticated, and, as mission experience
accumulates, the most accessible targets are visited, advancing the knowledge frontier to more difficult,
harsh, and inaccessible environments. This leads to new challenges including: hazardous conditions that
limit mission lifetime, such as high radiation levels surrounding interesting destinations like Europa or
toxic atmospheres of planetary bodies like Venus; unconstrained environments with navigation hazards,
such as free-floating active small bodies; multielement missions required to answer more sophisticated
questions, such as Mars Sample Return (MSR); and long-range missions, such as Kuiper belt exploration,
that must survive equipment failures over the span of decades. These missions will need to be successful
without a priori knowledge of the most efficient data collection techniques for optimum science return.
Science objectives will have to be revised ‘on the fly’, with new data collection and navigation decisions
on short timescales.
Yet, even as science objectives are becoming more ambitious, several critical resources remain
unchanged. Since physics imposes insurmountable light-time delays, anticipated improvements to the
Deep Space Network (DSN) will only marginally improve the bandwidth and communications cadence to
remote spacecraft. Fiscal resources are increasingly limited, resulting in fewer flagship missions, smaller
spacecraft, and less subsystem redundancy. As missions visit more distant and formidable locations, the
job of the operations team becomes more challenging, seemingly inconsistent with the trend of shrinking
mission budgets for operations support. How can we continue to explore challenging new locations
without increasing risk or system complexity?
These challenges are present, to some degree, for the entire Decadal Survey mission portfolio, as
documented in Vision and Voyages for Planetary Science in the Decade 2013–2022 (National Research
Council, 2011), but are especially acute for the following mission examples, identified in our recently
completed KISS Engineering Resilient Space Systems (ERSS) study:
1. A Venus lander, designed to sample the atmosphere and surface of Venus, would have to perform
science operations as components and subsystems degrade and fail;
2. A Trojan asteroid tour spacecraft would spend significant time cruising to its ultimate destination
(essentially hibernating to save on operations costs), then upon arrival, would have to act as its
own surveyor, finding new objects and targets of opportunity as it approaches each asteroid,
requiring response on short notice; and
3. A MSR campaign would not only be required to perform fast reconnaissance over long distances
on the surface of Mars, interact with an unknown physical surface, and handle degradations and
faults, but would also contain multiple components (launch vehicle, cruise stage, entry and
landing vehicle, surface rover, ascent vehicle, orbiting cache, and Earth return vehicle) that
dramatically increase the need for resilience to failure across the complex system.
The concept of resilience and its relevance and application in various domains was a focus during the
study, with several definitions of resilience proposed and discussed. While there was substantial variation
in the specifics, there was a common conceptual core that emerged—adaptation in the presence of
changing circumstances. These changes were couched in various ways—anomalies, disruptions,
discoveries—but they all ultimately had to do with changes in underlying assumptions. Invalid
assumptions, whether due to unexpected changes in the environment, or an inadequate understanding of
interactions within the system, may cause unexpected or unintended system behavior. A system is
resilient if it continues to perform the intended functions in the presence of invalid assumptions.
Our study focused on areas of resilience that we felt needed additional exploration and integration,
namely system and software architectures and capabilities, and autonomy technologies. (While also an
important consideration, resilience in hardware is being addressed in multiple other venues, including
2
other KISS studies.) The study consisted of two workshops, separated by a seven-month focused study
period. The first workshop (Workshop #1) explored the ‘problem space’ as an organizing theme, and the
second workshop (Workshop #2) explored the ‘solution space’. In each workshop, focused discussions
and exercises were interspersed with presentations from participants and invited speakers.
The study period between the two workshops was organized as part of the synthesis activity during the
first workshop. The study participants, after spending the initial days of the first workshop discussing the
nature of resilience and its impact on future science missions, decided to split into three focus groups,
each with a particular thrust, to explore specific ideas further and develop material needed for the second
workshop. The three focus groups and areas of exploration were:
1. Reference missions: address/refine the resilience needs by exploring a set of reference missions
2. Capability survey: collect, document, and assess current efforts to develop capabilities and
technology that could be used to address the documented needs, both inside and outside NASA
3. Architecture: analyze the impact of architecture on system resilience, and provide principles and
guidance for architecting greater resilience in our future systems
The key product of the second workshop was a set of capability roadmaps pertaining to the three
reference missions selected for their representative coverage of the types of space missions envisioned for
the future. From these three roadmaps, we have extracted several common capability patterns that would
be appropriate targets for near-term technical development: one focused on graceful degradation of
system functionality, a second focused on data understanding for science and engineering applications,
and a third focused on hazard avoidance and environmental uncertainty. Continuing work is extending
these roadmaps to identify candidate enablers of the capabilities from the following three categories:
architecture solutions, technology solutions, and process solutions.
The KISS study allowed a collection of diverse and engaged engineers, researchers, and scientists to think
deeply about the theory, approaches, and technical issues involved in developing and applying resilience
capabilities. The conclusions summarize the varied and disparate discussions that occurred during the
study, and include new insights about the nature of the challenge and potential solutions:
1. There is a clear and definitive need for more resilient space systems. During our study period,
the key scientists/engineers we engaged to understand potential future missions confirmed the
scientific and risk reduction value of greater resilience in the systems used to perform these
missions.
2. Resilience can be quantified in measurable terms—project cost, mission risk, and quality of
science return. In order to consider resilience properly in the set of engineering trades performed
during the design, integration, and operation of space systems, the benefits and costs of resilience
need to be quantified. We believe, based on the work done during the study, that appropriate
metrics to measure resilience must relate to risk, cost, and science quality/opportunity. Additional
work is required to explicitly tie design decisions to these first-order concerns.
3. There are many existing basic technologies that can be applied to engineering resilient space
systems. Through the discussions during the study, we found many varied approaches and
research that address the various facets of resilience, some within NASA, and many more
beyond. Examples from civil architecture, Department of Defense (DoD) / Defense Advanced
Research Projects Agency (DARPA) initiatives, ‘smart’ power grid control, cyber-physical
systems, software architecture, and application of formal verification methods for software were
identified and discussed. The variety and scope of related efforts is encouraging and presents
many opportunities for collaboration and development, and we expect many collaborative
proposals and joint research as a result of the study.
4. Use of principled architectural approaches is key to managing complexity and integrating
disparate technologies. The main challenge inherent in considering highly resilient space
systems is that the increase in capability can result in an increase in complexity with all of the
3
risks and costs associated with more complex systems. What is needed is a better way of
conceiving space systems that enables incorporation of capabilities without increasing
complexity. We believe principled architecting approaches provide the needed means to convey a
unified understanding of the system to primary stakeholders, thereby controlling complexity in
the conception and development of resilient systems, and enabling the integration of disparate
approaches and technologies. A representative architectural example is included in Appendix F.
5. Developing trusted resilience capabilities will require a diverse yet strategically directed
research program. Despite the interest in, and benefits of, deploying resilience space systems, to
date, there has been a notable lack of meaningful demonstrated progress in systems capable of
working in hazardous uncertain situations. The roadmaps completed during the study, and
documented in this report, provide the basis for a real funded plan that considers the required
fundamental work and evolution of needed capabilities.
Exploring space is a challenging and difficult endeavor. Future space missions will require more
resilience in order to perform the desired science in new environments under constraints of development
and operations cost, acceptable risk, and communications delays. Development of space systems with
resilient capabilities has the potential to expand the limits of possibility, revolutionizing space science by
enabling as yet unforeseen missions and breakthrough science observations.
Our KISS study provided an essential venue for the consideration of these challenges and goals.
Additional work and future steps are needed to realize the potential of resilient systems—this study
provided the necessary catalyst to begin this process
Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing
Abstract—Large applications executing on Grid or cluster architectures consisting of hundreds or thousands of computational nodes create problems with respect to reliability. The source of the problems are node failures and the need for dynamic configuration over extensive runtime. This paper presents two fault-tolerance mechanisms called Theft-Induced Checkpointing and Systematic Event Logging. These are transparent protocols capable of overcoming problems associated with both benign faults, i.e., crash faults, and node or subnet volatility. Specifically, the protocols base the state of the execution on a dataflow graph, allowing for efficient recovery in dynamic heterogeneous systems as well as multithreaded applications. By allowing recovery even under different numbers of processors, the approaches are especially suitable for applications with a need for adaptive or reactionary configuration control. The low-cost protocols offer the capability of controlling or bounding the overhead. A formal cost model is presented, followed by an experimental evaluation. It is shown that the overhead of the protocol is very small, and the maximum work lost by a crashed process is small and bounded. Index Terms—Grid computing, rollback recovery, checkpointing, event logging. Ç
Recommended from our members
Retrofitting Autonomic Capabilities onto Legacy Systems
Autonomic computing - self-configuring, self-healing, self-optimizing applications, systems and networks - is a promising solution to ever-increasing system complexity and the spiraling costs of human management as systems scale to global proportions. Most results to date, however, suggest ways to architect new software constructed from the ground up as autonomic systems, whereas in the real world organizations continue to use stovepipe legacy systems and/or build 'systems of systems' that draw from a gamut of disparate technologies from numerous vendors. Our goal is to retrofit autonomic computing onto such systems, externally, without any need to understand, modify or even recompile the target system's code. We present an autonomic infrastructure that operates similarly to active middleware, to explicitly add autonomic services to pre-existing systems via continual monitoring and a feedback loop that performs, as needed, reconfiguration and/or repair. Our lightweight design and separation of concerns enables easy adoption of individual components, independent of the rest of the full infrastructure, for use with a large variety of target systems. This work has been validated by several case studies spanning multiple application domains
- …