315 research outputs found

    Fault recovery mechanisms in utility accrual real time scheduling algorithm

    Get PDF
    In this paper, we proposed two recovery solutions over the existing error-free utility accrual scheduling algorithm known as General Utility Accrual Scheduling algorithm (or GUS) (Peng Li, 2004). A robust fault recovery algorithm called Backward Recovery GUS (or BRGUS) works by adapting the time redundancy model i.e., by re-executing the affected task after its transient error period is over. The BRGUS is compared with a less complicated recovery algorithm named as Abortion Recovery GUS (or ARGUS) that simply aborts all faulty tasks. Our main objectives are (1) to maximize the total accrued utility and (2) to ensure correctness of the executed tasks on best effort basis and achieve the fault free tasks as much as possible. Our simulation results reveal that BRGUS outperforms the ARGUS algorithm with higher accrued utility and less abortion ratio, making it more suitable and efficient in adaptive real time system

    Adaptive Mid-term and Short-term Scheduling of Mixed-criticality Systems

    Get PDF
    A mixed-criticality real-time system is a real-time system having multiple tasks classified according to their criticality. Research on mixed-criticality systems started to provide an effective and cost efficient a priori verification process for safety critical systems. The higher the criticality of a task within a system and the more the system should guarantee the required level of service for it. However, such model poses new challenges with respect to scheduling and fault tolerance within real-time systems. Currently, mixed-criticality scheduling protocols severely degrade lower criticality tasks in case of resource shortage to provide the required level of service for the most critical ones. The actual research challenge in this field is to devise robust scheduling protocols to minimise the impact on less critical tasks. This dissertation introduces two approaches, one short-term and the other medium-term, to appropriately allocate computing resources to tasks within mixed-criticality systems both on uniprocessor and multiprocessor systems. The short-term strategy consists of a protocol named Lazy Bailout Protocol (LBP) to schedule mixed-criticality task sets on single core architectures. Scheduling decisions are made about tasks that are active in the ready queue and that have to be dispatched to the CPU. LBP minimises the service degradation for lower criticality tasks by providing to them a background execution during the system idle time. After, I refined LBP with variants that aim to further increase the service level provided for lower criticality tasks. However, this is achieved at an increased cost of either system offline analysis or complexity at runtime. The second approach, named Adaptive Tolerance-based Mixed-criticality Protocol (ATMP), decides at runtime which task has to be allocated to the active cores according to the available resources. ATMP permits to optimise the overall system utility by tuning the system workload in case of shortage of computing capacity at runtime. Unlike the majority of current mixed-criticality approaches, ATMP allows to smoothly degrade also higher criticality tasks to keep allocated lower criticality ones

    Lifeguard: Local Health Awareness for More Accurate Failure Detection

    Full text link
    SWIM is a peer-to-peer group membership protocol with attractive scaling and robustness properties. However, slow message processing can cause SWIM to mark healthy members as failed (so called false positive failure detection), despite inclusion of a mechanism to avoid this. We identify the properties of SWIM that lead to the problem, and propose Lifeguard, a set of extensions to SWIM which consider that the local failure detector module may be at fault, via the concept of local health. We evaluate this approach in a precisely controlled environment and validate it in a real-world scenario, showing that it drastically reduces the rate of false positives. The false positive rate and detection time for true failures can be reduced simultaneously, compared to the baseline levels of SWIM

    Fourth Annual Workshop on Space Operations Applications and Research (SOAR 90)

    Get PDF
    The proceedings of the SOAR workshop are presented. The technical areas included are as follows: Automation and Robotics; Environmental Interactions; Human Factors; Intelligent Systems; and Life Sciences. NASA and Air Force programmatic overviews and panel sessions were also held in each technical area

    Adaptive Failure-Aware Scheduling for Hadoop

    Get PDF
    Given the dynamic nature of cloud environments, failures are the norm rather than the exception in data centers powering cloud frameworks. Despite the diversity of integrated recovery mechanisms in cloud frameworks, their schedulers still generate poor scheduling decisions leading to tasks' failures due to unforeseen events such as unpredicted demands of services or hardware outages. Traditionally, simulation and analytical modeling have been widely used to analyze the impact of the scheduling decisions on the failures rates. However, they cannot provide accurate results and exhaustive coverage of the cloud systems especially when failures occur. In this thesis, we present new approaches for modeling and verifying an adaptive failure-aware scheduling algorithm for Hadoop to early detect these failures and to reschedule tasks according to changes in the cloud. Hadoop is the framework of choice on many off-the-shelf clusters in the cloud to process data-intensive applications by efficiently running them across distributed multiple machines. The proposed scheduling algorithm for Hadoop relies on predictions made by machine learning algorithms trained on previously executed tasks and data collected from the Hadoop environment. To further improve Hadoop scheduling decisions on the fly, we use reinforcement learning techniques to select an appropriate scheduling action for a scheduled task. Furthermore, we propose an adaptive algorithm to dynamically detect failures of nodes in Hadoop. We implement the above approaches in ATLAS: an AdapTive Failure-Aware Scheduling algorithm that can be built on top of existing Hadoop schedulers. To illustrate the usefulness and benefits of ATLAS, we conduct a large empirical study on a Hadoop cluster deployed on Amazon Elastic MapReduce (EMR) to compare the performance of ATLAS to those of three Hadoop scheduling algorithms (FIFO, Fair, and Capacity). Results show that ATLAS outperforms these scheduling algorithms in terms of failures' rates, execution times, and resources utilization. Finally, we propose a new methodology to formally identify the impact of the scheduling decisions of Hadoop on the failures rates. We use model checking to verify some of the most important scheduling properties in Hadoop (schedulability, resources-deadlock freeness, and fairness) and provide possible strategies to avoid their occurrences in ATLAS. The formal verification of the Hadoop scheduler allows to identify more tasks failures and hence reduce the number of failures in ATLAS

    Tardiness Bounds and Overload in Soft Real-Time Systems

    Get PDF
    In some systems, such as future generations of unmanned aerial vehicles (UAVs), different software running on the same machine will require different timing guarantees. For example, flight control software has hard real-time (HRT) requirements---if a job (i.e., invocation of a program) completes late, then safety may be compromised, so jobs must be guaranteed to complete within short deadlines. However, mission control software is likely to have soft real-time (SRT) requirements---if a job completes slightly late, the result is not likely to be catastrophic, but lateness should never be unbounded. The global earliest-deadline-first (G-EDF) scheduler has been demonstrated to be useful for the multiprocessor scheduling of software with SRT requirements, and the multicore mixed-criticality (MC2) framework using G-EDF for SRT scheduling has been proposed to safely mix HRT and SRT work on multicore UAV platforms. This dissertation addresses limitations of this prior work. G-EDF is attractive for SRT systems because it allows the system to be fully utilized with reasonable overheads. Furthermore, previous analysis of G-EDF can provide "lateness bounds" on the amount of time between a job's deadline and its completion. However, smaller lateness bounds are preferable, and some programs may be more sensitive to lateness than others. In this dissertation, we explore the broader category of G-EDF-like (GEL) schedulers that have identical overhead characteristics to G-EDF. We show that by choosing GEL schedulers other than G-EDF, better lateness can be achieved, and that certain modifications can further improve lateness bounds while maintaining reasonable overheads. Specifically, successive jobs from the same program can be permitted to run in parallel with each other, or jobs can be split into smaller pieces by the operating system. Previous analysis of MC2 has always used less pessimistic execution time assumptions when analyzing SRT work than when analyzing HRT work. These assumptions can be violated, creating an overload that causes SRT guarantees to be violated. Furthermore, even in the expected case that such violations are transient, the system is not guaranteed to return to its normal operation. In this dissertation, we also provide a mechanism that can be used to provide such recovery.Doctor of Philosoph

    Service Quality Assessment for Cloud-based Distributed Data Services

    Full text link
    The issue of less-than-100% reliability and trust-worthiness of third-party controlled cloud components (e.g., IaaS and SaaS components from different vendors) may lead to laxity in the QoS guarantees offered by a service-support system S to various applications. An example of S is a replicated data service to handle customer queries with fault-tolerance and performance goals. QoS laxity (i.e., SLA violations) may be inadvertent: say, due to the inability of system designers to model the impact of sub-system behaviors onto a deliverable QoS. Sometimes, QoS laxity may even be intentional: say, to reap revenue-oriented benefits by cheating on resource allocations and/or excessive statistical-sharing of system resources (e.g., VM cycles, number of servers). Our goal is to assess how well the internal mechanisms of S are geared to offer a required level of service to the applications. We use computational models of S to determine the optimal feasible resource schedules and verify how close is the actual system behavior to a model-computed \u27gold-standard\u27. Our QoS assessment methods allow comparing different service vendors (possibly with different business policies) in terms of canonical properties: such as elasticity, linearity, isolation, and fairness (analogical to a comparative rating of restaurants). Case studies of cloud-based distributed applications are described to illustrate our QoS assessment methods. Specific systems studied in the thesis are: i) replicated data services where the servers may be hosted on multiple data-centers for fault-tolerance and performance reasons; and ii) content delivery networks to geographically distributed clients where the content data caches may reside on different data-centers. The methods studied in the thesis are useful in various contexts of QoS management and self-configurations in large-scale cloud-based distributed systems that are inherently complex due to size, diversity, and environment dynamicity

    Air Force Institute of Technology Contributions to Air Force Research and Development, Calendar Year 1987

    Get PDF
    From the introduction:The primary mission of the Air Force Institute of Technology (AFIT) is education, but research and consulting are essential integral elements in the process. This report highlights AFIT\u27s contributions to Air Force research and development activities [in 1987]
    corecore