1,064,361 research outputs found
Uncovering Bugs in Distributed Storage Systems during Testing (not in Production!)
Testing distributed systems is challenging due to multiple sources of nondeterminism. Conventional testing techniques, such as unit, integration and stress testing, are ineffective in preventing serious but subtle bugs from reaching production. Formal techniques, such as TLA+, can only verify high-level specifications of systems at the level of logic-based models, and fall short of checking the actual executable code. In this paper, we present a new methodology for testing distributed systems. Our approach applies advanced systematic testing techniques to thoroughly check that the executable code adheres to its high-level specifications, which significantly improves coverage of important system behaviors. Our methodology has been applied to three distributed storage systems in the Microsoft Azure cloud computing platform. In the process, numerous bugs were identified, reproduced, confirmed and fixed. These bugs required a subtle combination of concurrency and failures, making them extremely difficult to find with conventional testing techniques. An important advantage of our approach is that a bug is uncovered in a small setting and witnessed by a full system trace, which dramatically increases the productivity of debugging
A Case Study in Testing Distributed Systems
This paper describes a case study in the testing of distributed systems. The software under test is a middleware system developed in Java. The full test lifecycle is examined including unit testing, integration testing, and system testing. Where possible, traditional tools and techniques are used to carry out the testing. One aspect where this is not possible is the testing of the low-level concurrency, which is often overlooked when testing commercial distributed systems, since the middleware or application server is already developed by a third-party and is assumed to operate correctly. This paper examines testing the middleware system itself, and therefore, a method for testing the concurrency properties of the system is used. The testing revealed a number of faults and design weaknesses, and showed that, with some adaptation, traditional tools and techniques go a long way in the testing of distributed applications
Efficient testing based on logical architecture
The rapid increase of software-intensive systems' size and complexity makes it infeasible to exhaustively run testing on the low level of source code. Instead, the testing should be executed on the high level of system architecture, i.e., at a level where component or subsystems relate and interoperate or interact collectively with the system environment. Testing at this level is system testing, including hardware and software in union. Moreover, when integrating complex, distributed systems and providing support for conformance, interoperability and interoperation tests, we need to have an explicit test description. In this vision paper, we discuss (1) how to select tests from logical architecture, especially based on the dependencies within the system, and (2) how to represent the selected tests in explicit and readable manner, so that the software systems can be cost-e!ciently maintained and evolved over their entire life-cycle. In addition, we further study the relevance between di↵erent tests, based on which, we can optimise the test suites for e!cient testing, and propose optimal resource allocation strategies for cloud-based testing
Recommended from our members
Modular and Safe Event-Driven Programming
Asynchronous event-driven systems are ubiquitous across domains such as device drivers, distributed systems, and robotics. These systems are notoriously hard to get right as the programmer needs to reason about numerous control paths resulting from the complex interleaving of events (or messages) and failures. Unsurprisingly, it is easy to introduce subtle errors while attempting to fill in gaps between high-level system specifications and their concrete implementations.This dissertation proposes new methods for programming safe event-driven asynchronous systems.In the first part of the thesis, we present ModP, a modular programming framework for compositional programming and testing of event-driven asynchronous systems.The ModP module system supports a novel theory of compositional refinement for assume-guarantee reasoning of dynamic event-driven asynchronous systems. We build a complex distributed systems software stack using ModP.Our results demonstrate that compositional reasoning can help scale model-checking (both explicit and symbolic) to large distributed systems.ModP is transforming the way asynchronous software is built at Microsoft and Amazon Web Services (AWS). Microsoft uses ModP for implementing safe device drivers and other software in the Windows kernel.AWS uses ModP for compositional model checking of complex distributed systems. While ModP simplifies analysis of such systems, the state space of industrial-scale systems remains extremely large.In the second part of this thesis, we present scalable verification and systematic testing approaches to further mitigate this state-space explosion problem.First, we introduce the concept of a delaying explorer to perform prioritized exploration of the behaviors of an asynchronous reactive program. A delaying explorer stratifies the search space using a custom strategy (tailored towards finding bugs faster), and a delay operation that allows deviation from that strategy. We show that prioritized search with a delaying explorer performs significantly better than existing approaches for finding bugs in asynchronous programs.Next, we consider the challenge of verifying time-synchronized systems; these are almost-synchronous systems as they are neither completely asynchronous nor synchronous.We introduce approximate synchrony, a sound and tunable abstraction for verification of almost-synchronous systems. We show how approximate synchrony can be used for verification of both time-synchronization protocols and applications running on top of them.Moreover, we show how approximate synchrony also provides a useful strategy to guide state-space exploration during model-checking.Using approximate synchrony and implementing it as a delaying explorer, we were able to verify the correctness of the IEEE 1588 distributed time-synchronization protocol and, in the process, uncovered a bug in the protocol that was well appreciated by the standards committee.In the final part of this thesis, we consider the challenge of programming a special class of event-driven asynchronous systems -- safe autonomous robotics systems.Our approach towards achieving assured autonomy for robotics systems consists of two parts: (1) a high-level programming language for implementing and validating the reactive robotics software stack; and (2) an integrated runtime assurance system to ensure that the assumptions used during design-time validation of the high-level software hold at runtime.Combining high-level programming language and model-checking with runtime assurance helps us bridge the gap between design-time software validation that makes assumptions about the untrusted components (e.g., low-level controllers), and the physical world, and the actual execution of the software on a real robotic platform in the physical world. We implemented our approach as DRONA, a programming framework for building safe robotics systems.We used DRONA for building a distributed mobile robotics system and deployed it on real drone platforms. Our results demonstrate that DRONA (with the runtime-assurance capabilities) enables programmers to build an autonomous robotics software stack with formal safety guarantees.To summarize, this thesis contributes new theory and tools to the areas of programming languages, verification, systematic testing, and runtime assurance for programming safe asynchronous event-driven across the domains of fault-tolerant distributed systems and safe autonomous robotics systems
Towards quality-oriented architecture: Integration in a global context
This paper introduces an architectural framework for developing systems of systems, where the development plants are geographically distributed across different countries. The focus of our ongoing work is on architectural sustainability, in the sense of cost-effective longevity and endurance, and on quality assurance from the perspectives of integration in a global context. The core of our framework are different levels of abstraction, where state-of-the-art industrial development process is extended by the level of remote virtual system representation. Each abstraction level is associated with a different level of context-dependent architecture as well as the corresponding testing approaches
Adaptive Resonance: An Emerging Neural Theory of Cognition
Adaptive resonance is a theory of cognitive information processing which has been realized as a family of neural network models. In recent years, these models have evolved to incorporate new capabilities in the cognitive, neural, computational, and technological domains. Minimal models provide a conceptual framework, for formulating questions about the nature of cognition; an architectural framework, for mapping cognitive functions to cortical regions; a semantic framework, for precisely defining terms; and a computational framework, for testing hypotheses. These systems are here exemplified by the distributed ART (dART) model, which generalizes localist ART systems to allow arbitrarily distributed code representations, while retaining basic capabilities such as stable fast learning and scalability. Since each component is placed in the context of a unified real-time system, analysis can move from the level of neural processes, including learning laws and rules of synaptic transmission, to cognitive processes, including attention and consciousness. Local design is driven by global functional constraints, with each network synthesizing a dynamic balance of opposing tendencies. The self-contained working ART and dART models can also be transferred to technology, in areas that include remote sensing, sensor fusion, and content-addressable information retrieval from large databases.Office of Naval Research and the defense Advanced Research Projects Agency (N00014-95-1-0409, N00014-1-95-0657); National Institutes of Health (20-316-4304-5
Advancements in Real-Time Simulation of Power and Energy Systems
Modern power and energy systems are characterized by the wide integration of distributed generation, storage and electric vehicles, adoption of ICT solutions, and interconnection of different energy carriers and consumer engagement, posing new challenges and creating new opportunities. Advanced testing and validation methods are needed to efficiently validate power equipment and controls in the contemporary complex environment and support the transition to a cleaner and sustainable energy system. Real-time hardware-in-the-loop (HIL) simulation has proven to be an effective method for validating and de-risking power system equipment in highly realistic, flexible, and repeatable conditions. Controller hardware-in-the-loop (CHIL) and power hardware-in-the-loop (PHIL) are the two main HIL simulation methods used in industry and academia that contribute to system-level testing enhancement by exploiting the flexibility of digital simulations in testing actual controllers and power equipment. This book addresses recent advances in real-time HIL simulation in several domains (also in new and promising areas), including technique improvements to promote its wider use. It is composed of 14 papers dealing with advances in HIL testing of power electronic converters, power system protection, modeling for real-time digital simulation, co-simulation, geographically distributed HIL, and multiphysics HIL, among other topics
Aspect-oriented fault tolerance for real-time embedded systems
Real-time embedded systems for safety-critical applications have to introduce fault tolerance mechanisms in order to cope with hardware and software errors. Fault tolerance is usually applied by means of redundancy and diversity. Redundant hardware implies the establishment of a distributed system executing a set of fault tolerance strategies by software, and may also employ some form of diversity, by using different variants or versions for the same processing.
This paper describes our approach to introduce fault tolerance in distributed embedded systems applications, using aspect-oriented programming (AOP). A real-time operating system sup-porting middleware thread communication was integrated to a fault tolerant framework. The introduction of fault tolerance in the system is performed by AOP at the application thread level. The advantages of this approach include higher modularization, less efforts for legacy systems evolution and better configurability for testing and product line development. This work has been tested and evaluated successfully in several fault tolerant configurations and presented no significant performance or memory footprint costs.Fundação para a Ciência e a Tecnologia (FCT
- …