1,800 research outputs found
A Bag-of-Tasks Scheduler Tolerant to Temporal Failures in Clouds
Cloud platforms have emerged as a prominent environment to execute high
performance computing (HPC) applications providing on-demand resources as well
as scalability. They usually offer different classes of Virtual Machines (VMs)
which ensure different guarantees in terms of availability and volatility,
provisioning the same resource through multiple pricing models. For instance,
in Amazon EC2 cloud, the user pays per hour for on-demand VMs while spot VMs
are unused instances available for lower price. Despite the monetary
advantages, a spot VM can be terminated, stopped, or hibernated by EC2 at any
moment.
Using both hibernation-prone spot VMs (for cost sake) and on-demand VMs, we
propose in this paper a static scheduling for HPC applications which are
composed by independent tasks (bag-of-task) with deadline constraints. However,
if a spot VM hibernates and it does not resume within a time which guarantees
the application's deadline, a temporal failure takes place. Our scheduling,
thus, aims at minimizing monetary costs of bag-of-tasks applications in EC2
cloud, respecting its deadline and avoiding temporal failures. To this end, our
algorithm statically creates two scheduling maps: (i) the first one contains,
for each task, its starting time and on which VM (i.e., an available spot or
on-demand VM with the current lowest price) the task should execute; (ii) the
second one contains, for each task allocated on a VM spot in the first map, its
starting time and on which on-demand VM it should be executed to meet the
application deadline in order to avoid temporal failures. The latter will be
used whenever the hibernation period of a spot VM exceeds a time limit.
Performance results from simulation with task execution traces, configuration
of Amazon EC2 VM classes, and VMs market history confirms the effectiveness of
our scheduling and that it tolerates temporal failures
A Survey of Fault-Tolerance Techniques for Embedded Systems from the Perspective of Power, Energy, and Thermal Issues
The relentless technology scaling has provided a significant increase in processor performance, but on the other hand, it has led to adverse impacts on system reliability. In particular, technology scaling increases the processor susceptibility to radiation-induced transient faults. Moreover, technology scaling with the discontinuation of Dennard scaling increases the power densities, thereby temperatures, on the chip. High temperature, in turn, accelerates transistor aging mechanisms, which may ultimately lead to permanent faults on the chip. To assure a reliable system operation, despite these potential reliability concerns, fault-tolerance techniques have emerged. Specifically, fault-tolerance techniques employ some kind of redundancies to satisfy specific reliability requirements. However, the integration of fault-tolerance techniques into real-time embedded systems complicates preserving timing constraints. As a remedy, many task mapping/scheduling policies have been proposed to consider the integration of fault-tolerance techniques and enforce both timing and reliability guarantees for real-time embedded systems. More advanced techniques aim additionally at minimizing power and energy while at the same time satisfying timing and reliability constraints. Recently, some scheduling techniques have started to tackle a new challenge, which is the temperature increase induced by employing fault-tolerance techniques. These emerging techniques aim at satisfying temperature constraints besides timing and reliability constraints. This paper provides an in-depth survey of the emerging research efforts that exploit fault-tolerance techniques while considering timing, power/energy, and temperature from the real-time embedded systemsâ design perspective. In particular, the task mapping/scheduling policies for fault-tolerance real-time embedded systems are reviewed and classified according to their considered goals and constraints. Moreover, the employed fault-tolerance techniques, application models, and hardware models are considered as additional dimensions of the presented classification. Lastly, this survey gives deep insights into the main achievements and shortcomings of the existing approaches and highlights the most promising ones
Special session: Operating systems under test: An overview of the significance of the operating system in the resiliency of the computing continuum
The computing continuum's actual trend is facing a growth in terms of devices with any degree of computational capability. Those devices may or may not include a full-stack, including the Operating System layer and the Application layer, or just facing pure bare-metal solutions. In either case, the reliability of the full system stack has to be guaranteed. It is crucial to provide data regarding the impact of faults at all system stack levels and potential hardening solutions to design highly resilient systems. While most of the work usually concentrates on the application reliability, the special session aims to provide a deep comprehension of the impact on the reliability of an embedded system when faults in the hardware substrate of the system stack surface at the Operating System layer. For this reason, we will cover a comparison from an application perspective when hardware faults happen in bare metal vs. real-time OS vs. general-purpose OS. Then we will go deeper within a FreeRTOS to evaluate the contribution of all parts of the OS. Eventually, the Special Session will propose some hardening techniques at the Operating System level by exploiting the scheduling capabilities
Fault Tolerant Scheduling of Precedence Task Graphs on Heterogeneous Platforms
Fault tolerance and latency are important requirements in several applications which are time critical in nature: such applications require guaranties in terms of latency, even when processors are subject to failures. In this paper, we propose a fault tolerant scheduling heuristic for mapping precedence task graphs on heterogeneous systems. Our approach is based on an active replication scheme, capable of supporting arbitrary fail-silent (fail-stop) processor failures, hence valid results will be provided even if processors fail. We focus on a bi-criteria approach, where we aim at minimizing the latency given a fixed number of failures supported in the system, or the other way round. Major achievements include a low complexity, and a drastic reduction of the number of additional communications induced by the replication mechanism. Experimental results demonstrate that our heuristics, despite their lower complexity, outperform their direct competitor, the FTBAR scheduling algorithm[8].La tolĂ©rance aux pannes et la latence sont deux critĂšres importants pour plusieurs applications qui sont critiques par nature. Ce type dâapplications exige des garanties en terme de temps de latence, mĂȘme lorsque les processeurs sont sujets aux pannes. Dans ce rapport, nous proposons une heuristique tolĂ©rante aux pannes pour lâordonnancement de graphes de tĂąches sur des systĂšmes hĂ©tĂ©rogĂšnes. Notre approche est basĂ©e sur un mĂ©canisme de rĂ©plication active, capable de supporter " pannes arbitraires de type silence sur dĂ©faillance. En dâautres termes, des rĂ©sultats valides seront fournis mĂȘme si " processeurs tombent en panne. Nous nous concentrons sur une approche bi-critĂšre, oĂč nous avons pour objectif de minimiser le temps de latence pour un nombre donnĂ© (fixĂ©) de pannes tolĂ©rĂ©es dans le systĂšme, ou lâinverse. Les principales contributions incluent une faible complexitĂ© en temps dâexĂ©cution, et une rĂ©duction importante du nombre de communications induites par le mĂ©canisme de rĂ©plication.Les rĂ©sultats expĂ©rimentaux montrent que notre algorithme, en dĂ©pit de sa faible complexitĂ© temporelle, est meilleur que son direct compĂ©titeur,lâalgorithme FTBA
Comparison of Enhancing Methods for Primary/Backup Approach Meant for Fault Tolerant Scheduling
This report explores algorithms aiming at reducing the algorithm run-time and rejection rate when online scheduling tasks on real-time embedded systems consisting of several processors prone to fault occurrence. The authors introduce a new processor scheduling policy and propose new enhancing methods for the primary/backup approach and analyse their performances. The studied techniques are as follows: (i) the method of restricted scheduling windows within which the primary and backup copies can be scheduled, (ii) the method of limitation on the number of comparisons, accounting for the algorithm run-time, when scheduling a task on a system, and (iii) the method of several scheduling attempts. Last but not least, we inject faults to evaluate the impact on scheduling algorithms. Thorough experiments show that the best proposed method is based on the combination of the limitation on the number of comparisons and two scheduling attempts. When it is compared to the primary/backup approach without this method, the algorithm run-time is reduced by 23% (mean value) and 67% (maximum value) and the rejection rate is decreased by 4%. This improvement in the algorithm run-time is significant, especially for embedded systems dealing with hard real-time tasks. Finally, we found out that the studied algorithm performs well in a harsh environment
Comparison of Different Methods Making Use of Backup Copies for Fault-Tolerant Scheduling on Embedded Multiprocessor Systems
International audienceAs transistors scale down, systems are more vulnerable to faults. Their reliability consequently becomes the main concern, especially in safety-critical applications such as automotive sector, aeronautics or nuclear plants. Many methods have already been introduced to conceive fault-tolerant systems and therefore improve the reliability. Nevertheless, several of them are not suitable for real-time embedded systems since they incur significant overheads, other methods may be less intrusive but at the cost of being too specific to a dedicated system. The aim of this paper is to analyse a method making use of two task copies when on-line scheduling tasks on multiprocessor systems. This method can guarantee the system reliability without causing too much overhead and requiring any special hardware components. In addition, it remains general and thus applicable to large amount of systems. Last but not least, this paper studies two techniques of processor allocation policies: the exhaustive search and the first found solution search. It is shown that the exhaustive search is not necessary for efficient fault-tolerant scheduling and that the latter search significantly reduces the computation complexity, which is interesting for embedded systems
Contribution Ă lâordonnancement dynamique, tolĂ©rant aux fautes, de tĂąches pour les systĂšmes embarquĂ©s temps-rĂ©el multiprocesseurs
The thesis is concerned with online mapping and scheduling of tasks on multiprocessor embedded systems in order to improve the reliability subject to various constraints regarding e.g. time, or energy. To evaluate system performances, the number of rejected tasks, algorithm complexity and resilience assessed by injecting faults are analysed. The research was applied to: (i) the primary/backup approach technique, which is a fault tolerant one based on two task copies, and (ii) the scheduling algorithms for small satellites called CubeSats. The chief objective for the primary/backup approach is to analyse processor allocation strategies, devise novel enhancing scheduling methods and to choose one, which significantly reduces the algorithm run-time without worsening the system performances. Regarding CubeSats, the proposed idea is to gather all processors built into satellites on one board and design scheduling algorithms to make CubeSats more robust as to the faults. Two real CubeSat scenarios are analysed and it is found that it is useless to consider systems with more than six processors and that the presented algorithms perform well in a harsh environment and with energy constraints.La thĂšse se focalise sur le placement et lâordonnancement dynamique des tĂąches sur les systĂšmes embarquĂ©s multiprocesseurs pour amĂ©liorer leur fiabilitĂ© tout en tenant compte des contraintes telles que le temps rĂ©el ou lâĂ©nergie. Afin dâĂ©valuer les performances du systĂšme, le nombre de tĂąches rejetĂ©es, la complexitĂ© de lâalgorithme et la rĂ©silience estimĂ©e en injectant des fautes sont principalement analysĂ©s. La recherche est appliquĂ©e (i) Ă lâapproche de « primary/backup » qui est une technique de tolĂ©rance aux fautes basĂ©e sur deux copies dâune tĂąche et (ii) aux algorithmes de placement pour les petits satellites appelĂ©s CubeSats. Quant Ă lâapproche de « primary/backup », lâobjectif principal est dâĂ©tudier les stratĂ©gies dâallocation des processeurs, de proposer de nouvelles mĂ©thodes dâamĂ©lioration pour lâordonnancement et dâen choisir une qui diminue considĂ©rablement la durĂ©e de lâexĂ©cution de lâalgorithme sans dĂ©grader les performances du systĂšme. En ce qui concerne les CubeSats, lâidĂ©e est de regrouper tous les processeurs Ă bord et de concevoir des algorithmes dâordonnancement afin de rendre les CubeSats plus robustes. Les scĂ©narios provenant de deux CubeSats rĂ©els sont Ă©tudiĂ©s et les rĂ©sultats montrent quâil est inutile de considĂ©rer les systĂšmes ayant plus de six processeurs et que les algorithmes proposĂ©s fonctionnent bien mĂȘme avec des capacitĂ©s Ă©nergĂ©tiques limitĂ©es et dans un environnement hostile
An Efficient Uplink Multi-Connectivity Scheme for 5G mmWave Control Plane Applications
The millimeter wave (mmWave) frequencies offer the potential of orders of
magnitude increases in capacity for next-generation cellular systems. However,
links in mmWave networks are susceptible to blockage and may suffer from rapid
variations in quality. Connectivity to multiple cells - at mmWave and/or
traditional frequencies - is considered essential for robust communication. One
of the challenges in supporting multi-connectivity in mmWaves is the
requirement for the network to track the direction of each link in addition to
its power and timing. To address this challenge, we implement a novel uplink
measurement system that, with the joint help of a local coordinator operating
in the legacy band, guarantees continuous monitoring of the channel propagation
conditions and allows for the design of efficient control plane applications,
including handover, beam tracking and initial access. We show that an
uplink-based multi-connectivity approach enables less consuming, better
performing, faster and more stable cell selection and scheduling decisions with
respect to a traditional downlink-based standalone scheme. Moreover, we argue
that the presented framework guarantees (i) efficient tracking of the user in
the presence of the channel dynamics expected at mmWaves, and (ii) fast
reaction to situations in which the primary propagation path is blocked or not
available.Comment: Submitted for publication in IEEE Transactions on Wireless
Communications (TWC
Recommended from our members
Towards a Fault-tolerant, Scheduling Methodology for Safety-critical Certified Information Systems
Today, many critical information systems have safety-critical and non-safety-critical functions executed on the same platform in order to reduce design and implementation costs. The set of safety-critical functionality is subject to certification requirements and the rest of the functionality does not need to be certified, or is certified to a lower level. The resulting mixed-criticality systems bring challenges in designing such systems, especially when the critical tasks are required to complete with a timing constraint. This paper studies a problem of scheduling a mixed-criticality system with fault tolerance. A fault-recovery technique called checkpointing is used where a program can go back to a recent checkpoint for re-execution upon errors occurred. A novel schedulability test is derived to ensure that the safety-critical tasks are completed before their deadlines and the theoretical correctness is shown
- âŠ