1,383 research outputs found

    A small-scale testbed for large-scale reliable computing

    Get PDF
    High performance computing (HPC) systems frequently suffer errors and failures from hardware components that negatively impact the performance of jobs run on these systems. We analyzed system logs from two HPC systems at Purdue University and created statistical models for memory and hard disk errors. We created a small-scale error injection testbed—using a customized QEMU build, libvirt, and Python—for HPC application programmers to test and debug their programs in a faulty environment so that programmers can write more robust and resilient programs before deploying them on an actual HPC system. The deliverables for this project are the fault injection program, the modified QEMU source code, and the statistical models used for driving the injection

    Proactive software rejuvenation solution for web enviroments on virtualized platforms

    Get PDF
    The availability of the Information Technologies for everything, from everywhere, at all times is a growing requirement. We use information Technologies from common and social tasks to critical tasks like managing nuclear power plants or even the International Space Station (ISS). However, the availability of IT infrastructures is still a huge challenge nowadays. In a quick look around news, we can find reports of corporate outage, affecting millions of users and impacting on the revenue and image of the companies. It is well known that, currently, computer system outages are more often due to software faults, than hardware faults. Several studies have reported that one of the causes of unplanned software outages is the software aging phenomenon. This term refers to the accumulation of errors, usually causing resource contention, during long running application executions, like web applications, which normally cause applications/systems to hang or crash. Gradual performance degradation could also accompany software aging phenomena. The software aging phenomena are often related to memory bloating/ leaks, unterminated threads, data corruption, unreleased file-locks or overruns. We can find several examples of software aging in the industry. The work presented in this thesis aims to offer a proactive and predictive software rejuvenation solution for Internet Services against software aging caused by resource exhaustion. To this end, we first present a threshold based proactive rejuvenation to avoid the consequences of software aging. This first approach has some limitations, but the most important of them it is the need to know a priori the resource or resources involved in the crash and the critical condition values. Moreover, we need some expertise to fix the threshold value to trigger the rejuvenation action. Due to these limitations, we have evaluated the use of Machine Learning to overcome the weaknesses of our first approach to obtain a proactive and predictive solution. Finally, the current and increasing tendency to use virtualization technologies to improve the resource utilization has made traditional data centers turn into virtualized data centers or platforms. We have used a Mathematical Programming approach to virtual machine allocation and migration to optimize the resources, accepting as many services as possible on the platform while at the same time, guaranteeing the availability (via our software rejuvenation proposal) of the services deployed against the software aging phenomena. The thesis is supported by an exhaustive experimental evaluation that proves the effectiveness and feasibility of our proposals for current systems

    Software Defined Networks based Smart Grid Communication: A Comprehensive Survey

    Get PDF
    The current power grid is no longer a feasible solution due to ever-increasing user demand of electricity, old infrastructure, and reliability issues and thus require transformation to a better grid a.k.a., smart grid (SG). The key features that distinguish SG from the conventional electrical power grid are its capability to perform two-way communication, demand side management, and real time pricing. Despite all these advantages that SG will bring, there are certain issues which are specific to SG communication system. For instance, network management of current SG systems is complex, time consuming, and done manually. Moreover, SG communication (SGC) system is built on different vendor specific devices and protocols. Therefore, the current SG systems are not protocol independent, thus leading to interoperability issue. Software defined network (SDN) has been proposed to monitor and manage the communication networks globally. This article serves as a comprehensive survey on SDN-based SGC. In this article, we first discuss taxonomy of advantages of SDNbased SGC.We then discuss SDN-based SGC architectures, along with case studies. Our article provides an in-depth discussion on routing schemes for SDN-based SGC. We also provide detailed survey of security and privacy schemes applied to SDN-based SGC. We furthermore present challenges, open issues, and future research directions related to SDN-based SGC.Comment: Accepte

    Khaos: Dynamically Optimizing Checkpointing for Dependable Distributed Stream Processing

    Full text link
    Distributed Stream Processing systems are becoming an increasingly essential part of Big Data processing platforms as users grow ever more reliant on their ability to provide fast access to new results. As such, making timely decisions based on these results is dependent on a system's ability to tolerate failure. Typically, these systems achieve fault tolerance and the ability to recover automatically from partial failures by implementing checkpoint and rollback recovery. However, owing to the statistical probability of partial failures occurring in these distributed environments and the variability of workloads upon which jobs are expected to operate, static configurations will often not meet Quality of Service constraints with low overhead. In this paper we present Khaos, a new approach which utilizes the parallel processing capabilities of virtual cloud automation technologies for the automatic runtime optimization of fault tolerance configurations in Distributed Stream Processing jobs. Our approach employs three subsequent phases which borrows from the principles of Chaos Engineering: establish the steady-state processing conditions, conduct experiments to better understand how the system performs under failure, and use this knowledge to continuously minimize Quality of Service violations. We implemented Khaos prototypically together with Apache Flink and demonstrate its usefulness experimentally

    Dynamic load balancing based on live migration of virtual machines: Security threats and effects

    Get PDF
    Live migration of virtual machines (VMs) is the process of transitioning a VM from one virtual machine monitor (VMM) to another without halting the guest operating system, often between distinct physical machines, has opened new opportunities in computing. It allows a clean separation between hardware and software, and facilitates fault management, load balancing, and low-level system maintenance. Implemented by several existing virtualization products, live migration also aids in aspects such as high availability services, transparent mobility and consolidated management. While virtualization and live migration enable important new functionality, the combination introduces novel security challenges. A virtual machine monitor that incorporates a vulnerable implementation of live migration functionality may expose both the guest and host operating system to attack and result in a compromise of integrity. Given the large and increasing market for virtualization technology, a comprehensive understanding of virtual machine migration security is essential. So the main idea behind this thesis is to create a test environment that is suitable for experimenting and analyzing the security implications in case of exploitation of Live Migration of Virtual Machines. Using Live VM migration for dynamic load balancing or scheduling, this study determines workload hotspots in physical environment and through use of effective Live Migration process; tries to carry out resource profiling. By carrying out effective profiling, this thesis research is able to determine how much of each resource needs to be allocated to a VM. To understand exactly why process migration would not work in such scenarios and better understand Live VM Migration, this thesis tries to provide requisite incites as to which model is most appropriate for automatic load balancing for virtual machine infrastructure based on resource consumption. The security implications of exploiting the process of migration may end in unexpected results or results that are not noticeable. The scope of this thesis research is identifying these results and the causes for them

    SHStream: Self-Healing Framework for HTTP Video-Streaming

    Get PDF
    HTTP video-streaming is leading delivery of video content over the Internet. This phenomenon is explained by the ubiquity of web browsers, the permeability of HTTP traffic and the recent video technologies around HTML5. However, the inclusion of multimedia requests imposes new requirements on web servers due to responses with lifespans that can reach dozens of minutes and timing requirements for data fragments transmitted during the response period. Consequently, web- servers require real-time performance control to avoid playback outages caused by overloading and performance anomalies. We present SHStream , a self-healing framework for web servers delivering video-streaming content that provides (1) load admit- tance to avoid server overloading; (2) prediction of performance anomalies using online data stream learning algorithms; (3) continuous evaluation and selection of the best algorithm for prediction; and (4) proactive recovery by migrating the server to other hosts using container-based virtualization techniques. Evaluation of our framework using several variants of Hoeffding trees and ensemble algorithms showed that with a small number of learning instances, it is possible to achieve approximately 98% of recall and 99% of precision for failure predictions. Additionally, proactive failover can be performed in less than 1 secon
    • …
    corecore