182 research outputs found

    A self-healing framework for general software systems

    Get PDF
    Modern systems must guarantee high reliability, availability, and efficiency. Their complexity, exacerbated by the dynamic integration with other systems, the use of third- party services and the various different environments where they run, challenges development practices, tools and testing techniques. Testing cannot identify and remove all possible faults, thus faulty conditions may escape verification and validation activities and manifest themselves only after the system deployment. To cope with those failures, researchers have proposed the concept of self-healing systems. Such systems have the ability to examine their failures and to automatically take corrective actions. The idea is to create software systems that can integrate the knowledge that is needed to compensate for the effects of their imperfections. This knowledge is usually codified into the systems in the form of redundancy. Redundancy can be deliberately added into the systems as part of the design and the development process, as it occurs for many fault tolerance techniques. Although this kind of redundancy is widely applied, especially for safety- critical systems, it is however generally expensive to be used for common use software systems. We have some evidence that modern software systems are characterized by a different type of redundancy, which is not deliberately introduced but is naturally present due to the modern modular software design. We call it intrinsic redundancy. This thesis proposes a way to use the intrinsic redundancy of software systems to increase their reliability at a low cost. We first study the nature of the intrinsic redundancy to demonstrate that it actually exists. We then propose a way to express and encode such redundancy and an approach, Java Automatic Workaround, to exploit it automatically and at runtime to avoid system failures. Fundamentally, the Java Automatic Workaround approach replaces some failing operations with other alternative operations that are semantically equivalent in terms of the expected results and in the developer’s intent, but that they might have some syntactic difference that can ultimately overcome the failure. We qualitatively discuss the reasons of the presence of the intrinsic redundancy and we quantitatively study four large libraries to show that such redundancy is indeed a characteristic of modern software systems. We then develop the approach into a prototype and we evaluate it with four open source applications. Our studies show that the approach effectively exploits the intrinsic redundancy in avoiding failures automatically and at runtime

    Conceptual Modelling of Complex Network Management Systems

    Get PDF
    Society, as we know it today, is completely dependent on computer networks, Internet and distributed systems, which place at our disposal the necessary services to perform our daily tasks. Moreover, and unconsciously, all services and distributed systems require network management systems. These systems allow us to, in general, maintain, manage, configure, scale, adapt, modify, edit, protect or improve the main distributed systems. Their role is secondary and is unknown and transparent to the users. They provide the necessary support to maintain the distributed systems whose services we use every day. If we don’t consider network management systems during the development stage of main distributed systems, then there could be serious consequences or even total failures in the development of the distributed systems. It is necessary, therefore, to consider the management of the systems within the design of distributed systems and systematize their conception to minimize the impact of the management of networks within the project of distributed systems. In this paper, we present a formalization method of the conceptual modelling for design of a network management system through the use of formal modelling tools, thus allowing from the definition of processes to identify those responsible for these. Finally we will propose a use case to design a conceptual model intrusion detection system in network.This work was performed as part of the Smart University Project financed by the University of Alicante

    Conceptual Modelling of Complex Network Management Systems

    Get PDF
    Society, as we know it today, is completely dependent on computer networks, Internet and distributed systems, which place at our disposal the necessary services to perform our daily tasks. Moreover, and unconsciously, all services and distributed systems require network management systems. These systems allow us to, in general, maintain, manage, configure, scale, adapt, modify, edit, protect or improve the main distributed systems. Their role is secondary and is unknown and transparent to the users. They provide the necessary support to maintain the distributed systems whose services we use every day. If we don’t consider network management systems during the development stage of main distributed systems, then there could be serious consequences or even total failures in the development of the distributed systems. It is necessary, therefore, to consider the management of the systems within the design of distributed systems and systematize their conception to minimize the impact of the management of networks within the project of distributed systems. In this paper, we present a formalization method of the conceptual modelling for design of a network management system through the use of formal modelling tools, thus allowing from the definition of processes to identify those responsible for these. Finally we will propose a use case to design a conceptual model intrusion detection system in network.This work was performed as part of the Smart University Project financed by the University of Alicante

    Monitoring and analysis system for performance troubleshooting in data centers

    Get PDF
    It was not long ago. On Christmas Eve 2012, a war of troubleshooting began in Amazon data centers. It started at 12:24 PM, with an mistaken deletion of the state data of Amazon Elastic Load Balancing Service (ELB for short), which was not realized at that time. The mistake first led to a local issue that a small number of ELB service APIs were affected. In about six minutes, it evolved into a critical one that EC2 customers were significantly affected. One example was that Netflix, which was using hundreds of Amazon ELB services, was experiencing an extensive streaming service outage when many customers could not watch TV shows or movies on Christmas Eve. It took Amazon engineers 5 hours 42 minutes to find the root cause, the mistaken deletion, and another 15 hours and 32 minutes to fully recover the ELB service. The war ended at 8:15 AM the next day and brought the performance troubleshooting in data centers to world’s attention. As shown in this Amazon ELB case.Troubleshooting runtime performance issues is crucial in time-sensitive multi-tier cloud services because of their stringent end-to-end timing requirements, but it is also notoriously difficult and time consuming. To address the troubleshooting challenge, this dissertation proposes VScope, a flexible monitoring and analysis system for online troubleshooting in data centers. VScope provides primitive operations which data center operators can use to troubleshoot various performance issues. Each operation is essentially a series of monitoring and analysis functions executed on an overlay network. We design a novel software architecture for VScope so that the overlay networks can be generated, executed and terminated automatically, on-demand. From the troubleshooting side, we design novel anomaly detection algorithms and implement them in VScope. By running anomaly detection algorithms in VScope, data center operators are notified when performance anomalies happen. We also design a graph-based guidance approach, called VFocus, which tracks the interactions among hardware and software components in data centers. VFocus provides primitive operations by which operators can analyze the interactions to find out which components are relevant to the performance issue. VScope’s capabilities and performance are evaluated on a testbed with over 1000 virtual machines (VMs). Experimental results show that the VScope runtime negligibly perturbs system and application performance, and requires mere seconds to deploy monitoring and analytics functions on over 1000 nodes. This demonstrates VScope’s ability to support fast operation and online queries against a comprehensive set of application to system/platform level metrics, and a variety of representative analytics functions. When supporting algorithms with high computation complexity, VScope serves as a ‘thin layer’ that occupies no more than 5% of their total latency. Further, by using VFocus, VScope can locate problematic VMs that cannot be found via solely application-level monitoring, and in one of the use cases explored in the dissertation, it operates with levels of perturbation of over 400% less than what is seen for brute-force and most sampling-based approaches. We also validate VFocus with real-world data center traces. The experimental results show that VFocus has troubleshooting accuracy of 83% on average.Ph.D

    A model-based approach for automatic recovery from memory leaks in enterprise applications

    Get PDF
    Large-scale distributed computing systems such as data centers are hosted on heterogeneous and networked servers that execute in a dynamic and uncertain operating environment, caused by factors such as time-varying user workload and various failures. Therefore, achieving stringent quality-of-service goals is a challenging task, requiring a comprehensive approach to performance control, fault diagnosis, and failure recovery. This work presents a model-based approach for fault management, which integrates limited lookahead control (LLC), diagnosis, and fault-tolerance concepts that: (1) enables systems to adapt to environment variations, (2) maintains the availability and reliability of the system, (3) facilitates system recovery from failures. We focused on memory leak errors in this thesis. A characterization function is designed to detect memory leaks. Then, a LLC is applied to enable the computing system to adapt efficiently to variations in the workload, and to enable the system recover from memory leaks and maintain functionality

    Modeling Management Metrics for Monitoring Software Systems

    Get PDF
    Software systems are growing rapidly in size and complexity, and becoming more and more difficult and expensive to maintain exclusively by human operators. These systems are expected to be highly available, and failure in these systems is expensive. To meet availability and performance requirements within budget, automated and efficient approaches for systems monitoring are highly desirable. Autonomic computing is an effort in this direction, which promises systems that self-monitor, thus alleviating the burden of detailed operation oversight from human administrators. In particular, a solution is to develop automated monitoring systems that continuously collect monitoring data from target systems, analyze the data, detect errors and diagnose faults automatically. In this dissertation, we survey work based on management metrics and describe the common features of these current solutions. Based on observations of the advantages and drawbacks of these solutions, we present a general solution framework in four separate steps: metric modeling, system-health signature generation, system-state checking, and fault localization. Within our framework, we present two specific solutions for error detection and fault diagnosis in the system, one based on improved linear-regression modeling and the second based on summarizing the system state by an informationtheoretic measurement. We evaluate our monitoring solutions with fault-injection experiments in a J2EE benchmark and show the effectiveness and efficiency of our solutions

    A Concurrency and Time Centered Framework for Certification of Autonomous Space Systems

    Get PDF
    Future space missions, such as Mars Science Laboratory, suggest the engineering of some of the most complex man-rated autonomous software systems. The present process-oriented certification methodologies are becoming prohibitively expensive and do not reach the level of detail of providing guidelines for the development and validation of concurrent software. Time and concurrency are the most critical notions in an autonomous space system. In this work we present the design and implementation of the first concurrency and time centered framework for product-oriented software certification of autonomous space systems. To achieve fast and reliable concurrent interactions, we define and apply the notion of Semantically Enhanced Containers (SEC). SECs are data structures that are designed to provide the flexibility and usability of the popular ISO C++ STL containers, while at the same time they are hand-crafted to guarantee domain-specific policies, such as conformance to a given concurrency model. The application of nonblocking programming techniques is critical to the implementation of our SEC containers. Lock-free algorithms help avoid the hazards of deadlock, livelock, and priority inversion, and at the same time deliver fast and scalable performance. Practical lock-free algorithms are notoriously difficult to design and implement and pose a number of hard problems such as ABA avoidance, high complexity, portability, and meeting the linearizability correctness requirements. This dissertation presents the design of the first lock-free dynamically resizable array. Our approach o ers a set of practical, portable, lock-free, and linearizable STL vector operations and a fast and space effcient implementation when compared to the alternative lock- and STM-based techniques. Currently, the literature does not offer an explicit analysis of the ABA problem, its relation to the most commonly applied nonblocking programming techniques, and the possibilities for its detection and avoidance. Eliminating the hazards of ABA is left to the ingenuity of the software designer. We present a generic and practical solution to the fundamental ABA problem for lock-free descriptor-based designs. To enable our SEC container with the property of validating domain-specific invariants, we present Basic Query, our expression template-based library for statically extracting semantic information from C++ source code. The use of static analysis allows for a far more efficient implementation of our nonblocking containers than would have been otherwise possible when relying on the traditional run-time based techniques. Shared data in a real-time cyber-physical system can often be polymorphic (as is the case with a number of components part of the Mission Data System's Data Management Services). The use of dynamic cast is important in the design of autonomous real-time systems since the operation allows for a direct representation of the management and behavior of polymorphic data. To allow for the application of dynamic cast in mission critical code, we validate and improve a methodology for constant-time dynamic cast that shifts the complexity of the operation to the compiler's static checker. In a case study that demonstrates the applicability of the programming and validation techniques of our certification framework, we show the process of verification and semantic parallelization of the Mission Data System's (MDS) Goal Networks. MDS provides an experimental platform for testing and development of autonomous real-time flight applications
    corecore