1,800 research outputs found

    A Bag-of-Tasks Scheduler Tolerant to Temporal Failures in Clouds

    Full text link
    Cloud platforms have emerged as a prominent environment to execute high performance computing (HPC) applications providing on-demand resources as well as scalability. They usually offer different classes of Virtual Machines (VMs) which ensure different guarantees in terms of availability and volatility, provisioning the same resource through multiple pricing models. For instance, in Amazon EC2 cloud, the user pays per hour for on-demand VMs while spot VMs are unused instances available for lower price. Despite the monetary advantages, a spot VM can be terminated, stopped, or hibernated by EC2 at any moment. Using both hibernation-prone spot VMs (for cost sake) and on-demand VMs, we propose in this paper a static scheduling for HPC applications which are composed by independent tasks (bag-of-task) with deadline constraints. However, if a spot VM hibernates and it does not resume within a time which guarantees the application's deadline, a temporal failure takes place. Our scheduling, thus, aims at minimizing monetary costs of bag-of-tasks applications in EC2 cloud, respecting its deadline and avoiding temporal failures. To this end, our algorithm statically creates two scheduling maps: (i) the first one contains, for each task, its starting time and on which VM (i.e., an available spot or on-demand VM with the current lowest price) the task should execute; (ii) the second one contains, for each task allocated on a VM spot in the first map, its starting time and on which on-demand VM it should be executed to meet the application deadline in order to avoid temporal failures. The latter will be used whenever the hibernation period of a spot VM exceeds a time limit. Performance results from simulation with task execution traces, configuration of Amazon EC2 VM classes, and VMs market history confirms the effectiveness of our scheduling and that it tolerates temporal failures

    A Survey of Fault-Tolerance Techniques for Embedded Systems from the Perspective of Power, Energy, and Thermal Issues

    Get PDF
    The relentless technology scaling has provided a significant increase in processor performance, but on the other hand, it has led to adverse impacts on system reliability. In particular, technology scaling increases the processor susceptibility to radiation-induced transient faults. Moreover, technology scaling with the discontinuation of Dennard scaling increases the power densities, thereby temperatures, on the chip. High temperature, in turn, accelerates transistor aging mechanisms, which may ultimately lead to permanent faults on the chip. To assure a reliable system operation, despite these potential reliability concerns, fault-tolerance techniques have emerged. Specifically, fault-tolerance techniques employ some kind of redundancies to satisfy specific reliability requirements. However, the integration of fault-tolerance techniques into real-time embedded systems complicates preserving timing constraints. As a remedy, many task mapping/scheduling policies have been proposed to consider the integration of fault-tolerance techniques and enforce both timing and reliability guarantees for real-time embedded systems. More advanced techniques aim additionally at minimizing power and energy while at the same time satisfying timing and reliability constraints. Recently, some scheduling techniques have started to tackle a new challenge, which is the temperature increase induced by employing fault-tolerance techniques. These emerging techniques aim at satisfying temperature constraints besides timing and reliability constraints. This paper provides an in-depth survey of the emerging research efforts that exploit fault-tolerance techniques while considering timing, power/energy, and temperature from the real-time embedded systems’ design perspective. In particular, the task mapping/scheduling policies for fault-tolerance real-time embedded systems are reviewed and classified according to their considered goals and constraints. Moreover, the employed fault-tolerance techniques, application models, and hardware models are considered as additional dimensions of the presented classification. Lastly, this survey gives deep insights into the main achievements and shortcomings of the existing approaches and highlights the most promising ones

    Special session: Operating systems under test: An overview of the significance of the operating system in the resiliency of the computing continuum

    Get PDF
    The computing continuum's actual trend is facing a growth in terms of devices with any degree of computational capability. Those devices may or may not include a full-stack, including the Operating System layer and the Application layer, or just facing pure bare-metal solutions. In either case, the reliability of the full system stack has to be guaranteed. It is crucial to provide data regarding the impact of faults at all system stack levels and potential hardening solutions to design highly resilient systems. While most of the work usually concentrates on the application reliability, the special session aims to provide a deep comprehension of the impact on the reliability of an embedded system when faults in the hardware substrate of the system stack surface at the Operating System layer. For this reason, we will cover a comparison from an application perspective when hardware faults happen in bare metal vs. real-time OS vs. general-purpose OS. Then we will go deeper within a FreeRTOS to evaluate the contribution of all parts of the OS. Eventually, the Special Session will propose some hardening techniques at the Operating System level by exploiting the scheduling capabilities

    Fault Tolerant Scheduling of Precedence Task Graphs on Heterogeneous Platforms

    Get PDF
    Fault tolerance and latency are important requirements in several applications which are time critical in nature: such applications require guaranties in terms of latency, even when processors are subject to failures. In this paper, we propose a fault tolerant scheduling heuristic for mapping precedence task graphs on heterogeneous systems. Our approach is based on an active replication scheme, capable of supporting Δ\varepsilon arbitrary fail-silent (fail-stop) processor failures, hence valid results will be provided even if Δ\varepsilon processors fail. We focus on a bi-criteria approach, where we aim at minimizing the latency given a fixed number of failures supported in the system, or the other way round. Major achievements include a low complexity, and a drastic reduction of the number of additional communications induced by the replication mechanism. Experimental results demonstrate that our heuristics, despite their lower complexity, outperform their direct competitor, the FTBAR scheduling algorithm[8].La tolĂ©rance aux pannes et la latence sont deux critĂšres importants pour plusieurs applications qui sont critiques par nature. Ce type d’applications exige des garanties en terme de temps de latence, mĂȘme lorsque les processeurs sont sujets aux pannes. Dans ce rapport, nous proposons une heuristique tolĂ©rante aux pannes pour l’ordonnancement de graphes de tĂąches sur des systĂšmes hĂ©tĂ©rogĂšnes. Notre approche est basĂ©e sur un mĂ©canisme de rĂ©plication active, capable de supporter " pannes arbitraires de type silence sur dĂ©faillance. En d’autres termes, des rĂ©sultats valides seront fournis mĂȘme si " processeurs tombent en panne. Nous nous concentrons sur une approche bi-critĂšre, oĂč nous avons pour objectif de minimiser le temps de latence pour un nombre donnĂ© (fixĂ©) de pannes tolĂ©rĂ©es dans le systĂšme, ou l’inverse. Les principales contributions incluent une faible complexitĂ© en temps d’exĂ©cution, et une rĂ©duction importante du nombre de communications induites par le mĂ©canisme de rĂ©plication.Les rĂ©sultats expĂ©rimentaux montrent que notre algorithme, en dĂ©pit de sa faible complexitĂ© temporelle, est meilleur que son direct compĂ©titeur,l’algorithme FTBA

    Comparison of Enhancing Methods for Primary/Backup Approach Meant for Fault Tolerant Scheduling

    Get PDF
    This report explores algorithms aiming at reducing the algorithm run-time and rejection rate when online scheduling tasks on real-time embedded systems consisting of several processors prone to fault occurrence. The authors introduce a new processor scheduling policy and propose new enhancing methods for the primary/backup approach and analyse their performances. The studied techniques are as follows: (i) the method of restricted scheduling windows within which the primary and backup copies can be scheduled, (ii) the method of limitation on the number of comparisons, accounting for the algorithm run-time, when scheduling a task on a system, and (iii) the method of several scheduling attempts. Last but not least, we inject faults to evaluate the impact on scheduling algorithms. Thorough experiments show that the best proposed method is based on the combination of the limitation on the number of comparisons and two scheduling attempts. When it is compared to the primary/backup approach without this method, the algorithm run-time is reduced by 23% (mean value) and 67% (maximum value) and the rejection rate is decreased by 4%. This improvement in the algorithm run-time is significant, especially for embedded systems dealing with hard real-time tasks. Finally, we found out that the studied algorithm performs well in a harsh environment

    Comparison of Different Methods Making Use of Backup Copies for Fault-Tolerant Scheduling on Embedded Multiprocessor Systems

    Get PDF
    International audienceAs transistors scale down, systems are more vulnerable to faults. Their reliability consequently becomes the main concern, especially in safety-critical applications such as automotive sector, aeronautics or nuclear plants. Many methods have already been introduced to conceive fault-tolerant systems and therefore improve the reliability. Nevertheless, several of them are not suitable for real-time embedded systems since they incur significant overheads, other methods may be less intrusive but at the cost of being too specific to a dedicated system. The aim of this paper is to analyse a method making use of two task copies when on-line scheduling tasks on multiprocessor systems. This method can guarantee the system reliability without causing too much overhead and requiring any special hardware components. In addition, it remains general and thus applicable to large amount of systems. Last but not least, this paper studies two techniques of processor allocation policies: the exhaustive search and the first found solution search. It is shown that the exhaustive search is not necessary for efficient fault-tolerant scheduling and that the latter search significantly reduces the computation complexity, which is interesting for embedded systems

    Contribution Ă  l’ordonnancement dynamique, tolĂ©rant aux fautes, de tĂąches pour les systĂšmes embarquĂ©s temps-rĂ©el multiprocesseurs

    Get PDF
    The thesis is concerned with online mapping and scheduling of tasks on multiprocessor embedded systems in order to improve the reliability subject to various constraints regarding e.g. time, or energy. To evaluate system performances, the number of rejected tasks, algorithm complexity and resilience assessed by injecting faults are analysed. The research was applied to: (i) the primary/backup approach technique, which is a fault tolerant one based on two task copies, and (ii) the scheduling algorithms for small satellites called CubeSats. The chief objective for the primary/backup approach is to analyse processor allocation strategies, devise novel enhancing scheduling methods and to choose one, which significantly reduces the algorithm run-time without worsening the system performances. Regarding CubeSats, the proposed idea is to gather all processors built into satellites on one board and design scheduling algorithms to make CubeSats more robust as to the faults. Two real CubeSat scenarios are analysed and it is found that it is useless to consider systems with more than six processors and that the presented algorithms perform well in a harsh environment and with energy constraints.La thĂšse se focalise sur le placement et l’ordonnancement dynamique des tĂąches sur les systĂšmes embarquĂ©s multiprocesseurs pour amĂ©liorer leur fiabilitĂ© tout en tenant compte des contraintes telles que le temps rĂ©el ou l’énergie. Afin d’évaluer les performances du systĂšme, le nombre de tĂąches rejetĂ©es, la complexitĂ© de l’algorithme et la rĂ©silience estimĂ©e en injectant des fautes sont principalement analysĂ©s. La recherche est appliquĂ©e (i) Ă  l’approche de « primary/backup » qui est une technique de tolĂ©rance aux fautes basĂ©e sur deux copies d’une tĂąche et (ii) aux algorithmes de placement pour les petits satellites appelĂ©s CubeSats. Quant Ă  l’approche de « primary/backup », l’objectif principal est d’étudier les stratĂ©gies d’allocation des processeurs, de proposer de nouvelles mĂ©thodes d’amĂ©lioration pour l’ordonnancement et d’en choisir une qui diminue considĂ©rablement la durĂ©e de l’exĂ©cution de l’algorithme sans dĂ©grader les performances du systĂšme. En ce qui concerne les CubeSats, l’idĂ©e est de regrouper tous les processeurs Ă  bord et de concevoir des algorithmes d’ordonnancement afin de rendre les CubeSats plus robustes. Les scĂ©narios provenant de deux CubeSats rĂ©els sont Ă©tudiĂ©s et les rĂ©sultats montrent qu’il est inutile de considĂ©rer les systĂšmes ayant plus de six processeurs et que les algorithmes proposĂ©s fonctionnent bien mĂȘme avec des capacitĂ©s Ă©nergĂ©tiques limitĂ©es et dans un environnement hostile

    An Efficient Uplink Multi-Connectivity Scheme for 5G mmWave Control Plane Applications

    Full text link
    The millimeter wave (mmWave) frequencies offer the potential of orders of magnitude increases in capacity for next-generation cellular systems. However, links in mmWave networks are susceptible to blockage and may suffer from rapid variations in quality. Connectivity to multiple cells - at mmWave and/or traditional frequencies - is considered essential for robust communication. One of the challenges in supporting multi-connectivity in mmWaves is the requirement for the network to track the direction of each link in addition to its power and timing. To address this challenge, we implement a novel uplink measurement system that, with the joint help of a local coordinator operating in the legacy band, guarantees continuous monitoring of the channel propagation conditions and allows for the design of efficient control plane applications, including handover, beam tracking and initial access. We show that an uplink-based multi-connectivity approach enables less consuming, better performing, faster and more stable cell selection and scheduling decisions with respect to a traditional downlink-based standalone scheme. Moreover, we argue that the presented framework guarantees (i) efficient tracking of the user in the presence of the channel dynamics expected at mmWaves, and (ii) fast reaction to situations in which the primary propagation path is blocked or not available.Comment: Submitted for publication in IEEE Transactions on Wireless Communications (TWC
    • 

    corecore