Search CORE

1,800 research outputs found

A Bag-of-Tasks Scheduler Tolerant to Temporal Failures in Clouds

Author: Arantes Luciana
Drummond Lúcia Maria de A.
Sens Pierre
Teylo Luan
Publication venue
Publication date: 24/10/2018
Field of study

Cloud platforms have emerged as a prominent environment to execute high performance computing (HPC) applications providing on-demand resources as well as scalability. They usually offer different classes of Virtual Machines (VMs) which ensure different guarantees in terms of availability and volatility, provisioning the same resource through multiple pricing models. For instance, in Amazon EC2 cloud, the user pays per hour for on-demand VMs while spot VMs are unused instances available for lower price. Despite the monetary advantages, a spot VM can be terminated, stopped, or hibernated by EC2 at any moment. Using both hibernation-prone spot VMs (for cost sake) and on-demand VMs, we propose in this paper a static scheduling for HPC applications which are composed by independent tasks (bag-of-task) with deadline constraints. However, if a spot VM hibernates and it does not resume within a time which guarantees the application's deadline, a temporal failure takes place. Our scheduling, thus, aims at minimizing monetary costs of bag-of-tasks applications in EC2 cloud, respecting its deadline and avoiding temporal failures. To this end, our algorithm statically creates two scheduling maps: (i) the first one contains, for each task, its starting time and on which VM (i.e., an available spot or on-demand VM with the current lowest price) the task should execute; (ii) the second one contains, for each task allocated on a VM spot in the first map, its starting time and on which on-demand VM it should be executed to meet the application deadline in order to avoid temporal failures. The latter will be used whenever the hibernation period of a spot VM exceeds a time limit. Performance results from simulation with task execution traces, configuration of Amazon EC2 VM classes, and VMs market history confirms the effectiveness of our scheduling and that it tolerates temporal failures

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

A Survey of Fault-Tolerance Techniques for Embedded Systems from the Perspective of Power, Energy, and Thermal Issues

Author: Ansari M.
Ejlali A.
Henkel J.
Hessabi S.
Khdr H.
Nazari P. G.
Safari S.
Yari-Karin S.
Yeganeh-Khaksar A.
Publication venue: Institute of Electrical and Electronics Engineers
Publication date: 02/02/2022
Field of study

The relentless technology scaling has provided a significant increase in processor performance, but on the other hand, it has led to adverse impacts on system reliability. In particular, technology scaling increases the processor susceptibility to radiation-induced transient faults. Moreover, technology scaling with the discontinuation of Dennard scaling increases the power densities, thereby temperatures, on the chip. High temperature, in turn, accelerates transistor aging mechanisms, which may ultimately lead to permanent faults on the chip. To assure a reliable system operation, despite these potential reliability concerns, fault-tolerance techniques have emerged. Specifically, fault-tolerance techniques employ some kind of redundancies to satisfy specific reliability requirements. However, the integration of fault-tolerance techniques into real-time embedded systems complicates preserving timing constraints. As a remedy, many task mapping/scheduling policies have been proposed to consider the integration of fault-tolerance techniques and enforce both timing and reliability guarantees for real-time embedded systems. More advanced techniques aim additionally at minimizing power and energy while at the same time satisfying timing and reliability constraints. Recently, some scheduling techniques have started to tackle a new challenge, which is the temperature increase induced by employing fault-tolerance techniques. These emerging techniques aim at satisfying temperature constraints besides timing and reliability constraints. This paper provides an in-depth survey of the emerging research efforts that exploit fault-tolerance techniques while considering timing, power/energy, and temperature from the real-time embedded systems’ design perspective. In particular, the task mapping/scheduling policies for fault-tolerance real-time embedded systems are reviewed and classified according to their considered goals and constraints. Moreover, the employed fault-tolerance techniques, application models, and hardware models are considered as additional dimensions of the presented classification. Lastly, this survey gives deep insights into the main achievements and shortcomings of the existing approaches and highlights the most promising ones

KITopen

Special session: Operating systems under test: An overview of the significance of the operating system in the resiliency of the computing continuum

Author: Bosio A.
Casseau E.
Di Carlo S.
Dobias P.
Kastensmidt F.
Rebaudengo M.
Rodrigues G. S.
Savino A.
Sinnen O.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

The computing continuum's actual trend is facing a growth in terms of devices with any degree of computational capability. Those devices may or may not include a full-stack, including the Operating System layer and the Application layer, or just facing pure bare-metal solutions. In either case, the reliability of the full system stack has to be guaranteed. It is crucial to provide data regarding the impact of faults at all system stack levels and potential hardening solutions to design highly resilient systems. While most of the work usually concentrates on the application reliability, the special session aims to provide a deep comprehension of the impact on the reliability of an embedded system when faults in the hardware substrate of the system stack surface at the Operating System layer. For this reason, we will cover a comparison from an application perspective when hardware faults happen in bare metal vs. real-time OS vs. general-purpose OS. Then we will go deeper within a FreeRTOS to evaluate the contribution of all parts of the OS. Eventually, the Special Session will propose some hardening techniques at the Operating System level by exploiting the scheduling capabilities

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Fault Tolerant Scheduling of Precedence Task Graphs on Heterogeneous Platforms

Author: Benoit Anne
Hakem Mourad
Robert Yves
Publication venue: HAL CCSD
Publication date: 01/01/2007
Field of study

Fault tolerance and latency are important requirements in several applications which are time critical in nature: such applications require guaranties in terms of latency, even when processors are subject to failures. In this paper, we propose a fault tolerant scheduling heuristic for mapping precedence task graphs on heterogeneous systems. Our approach is based on an active replication scheme, capable of supporting

\varepsilon

arbitrary fail-silent (fail-stop) processor failures, hence valid results will be provided even if

\varepsilon

processors fail. We focus on a bi-criteria approach, where we aim at minimizing the latency given a fixed number of failures supported in the system, or the other way round. Major achievements include a low complexity, and a drastic reduction of the number of additional communications induced by the replication mechanism. Experimental results demonstrate that our heuristics, despite their lower complexity, outperform their direct competitor, the FTBAR scheduling algorithm[8].La tolérance aux pannes et la latence sont deux critères importants pour plusieurs applications qui sont critiques par nature. Ce type d’applications exige des garanties en terme de temps de latence, même lorsque les processeurs sont sujets aux pannes. Dans ce rapport, nous proposons une heuristique tolérante aux pannes pour l’ordonnancement de graphes de tâches sur des systèmes hétérogènes. Notre approche est basée sur un mécanisme de réplication active, capable de supporter " pannes arbitraires de type silence sur défaillance. En d’autres termes, des résultats valides seront fournis même si " processeurs tombent en panne. Nous nous concentrons sur une approche bi-critère, où nous avons pour objectif de minimiser le temps de latence pour un nombre donné (fixé) de pannes tolérées dans le système, ou l’inverse. Les principales contributions incluent une faible complexité en temps d’exécution, et une réduction importante du nombre de communications induites par le mécanisme de réplication.Les résultats expérimentaux montrent que notre algorithme, en dépit de sa faible complexité temporelle, est meilleur que son direct compétiteur,l’algorithme FTBA

HAL-ENS-LYON

CiteSeerX

HAL - Université de Franche-Comté

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot

Comparison of Enhancing Methods for Primary/Backup Approach Meant for Fault Tolerant Scheduling

Author: Casseau Emmanuel
Dobiáš Petr
Sinnen Oliver
Publication venue: HAL CCSD
Publication date: 26/10/2021
Field of study

This report explores algorithms aiming at reducing the algorithm run-time and rejection rate when online scheduling tasks on real-time embedded systems consisting of several processors prone to fault occurrence. The authors introduce a new processor scheduling policy and propose new enhancing methods for the primary/backup approach and analyse their performances. The studied techniques are as follows: (i) the method of restricted scheduling windows within which the primary and backup copies can be scheduled, (ii) the method of limitation on the number of comparisons, accounting for the algorithm run-time, when scheduling a task on a system, and (iii) the method of several scheduling attempts. Last but not least, we inject faults to evaluate the impact on scheduling algorithms. Thorough experiments show that the best proposed method is based on the combination of the limitation on the number of comparisons and two scheduling attempts. When it is compared to the primary/backup approach without this method, the algorithm run-time is reduced by 23% (mean value) and 67% (maximum value) and the rejection rate is decreased by 4%. This improvement in the algorithm run-time is significant, especially for embedded systems dealing with hard real-time tasks. Finally, we found out that the studied algorithm performs well in a harsh environment

INRIA a CCSD electronic archive server

Comparison of Different Methods Making Use of Backup Copies for Fault-Tolerant Scheduling on Embedded Multiprocessor Systems

Author: Casseau Emmanuel
Dobiáš Petr
Sinnen Oliver
Publication venue: HAL CCSD
Publication date: 09/10/2018
Field of study

International audienceAs transistors scale down, systems are more vulnerable to faults. Their reliability consequently becomes the main concern, especially in safety-critical applications such as automotive sector, aeronautics or nuclear plants. Many methods have already been introduced to conceive fault-tolerant systems and therefore improve the reliability. Nevertheless, several of them are not suitable for real-time embedded systems since they incur significant overheads, other methods may be less intrusive but at the cost of being too specific to a dedicated system. The aim of this paper is to analyse a method making use of two task copies when on-line scheduling tasks on multiprocessor systems. This method can guarantee the system reliability without causing too much overhead and requiring any special hardware components. In addition, it remains general and thus applicable to large amount of systems. Last but not least, this paper studies two techniques of processor allocation policies: the exhaustive search and the first found solution search. It is shown that the exhaustive search is not necessary for efficient fault-tolerant scheduling and that the latter search significantly reduces the computation complexity, which is interesting for embedded systems

INRIA a CCSD electronic archive server

Contribution à l’ordonnancement dynamique, tolérant aux fautes, de tâches pour les systèmes embarqués temps-réel multiprocesseurs

Author: Dobiáš Petr
Publication venue: HAL CCSD
Publication date: 02/10/2020
Field of study

The thesis is concerned with online mapping and scheduling of tasks on multiprocessor embedded systems in order to improve the reliability subject to various constraints regarding e.g. time, or energy. To evaluate system performances, the number of rejected tasks, algorithm complexity and resilience assessed by injecting faults are analysed. The research was applied to: (i) the primary/backup approach technique, which is a fault tolerant one based on two task copies, and (ii) the scheduling algorithms for small satellites called CubeSats. The chief objective for the primary/backup approach is to analyse processor allocation strategies, devise novel enhancing scheduling methods and to choose one, which significantly reduces the algorithm run-time without worsening the system performances. Regarding CubeSats, the proposed idea is to gather all processors built into satellites on one board and design scheduling algorithms to make CubeSats more robust as to the faults. Two real CubeSat scenarios are analysed and it is found that it is useless to consider systems with more than six processors and that the presented algorithms perform well in a harsh environment and with energy constraints.La thèse se focalise sur le placement et l’ordonnancement dynamique des tâches sur les systèmes embarqués multiprocesseurs pour améliorer leur fiabilité tout en tenant compte des contraintes telles que le temps réel ou l’énergie. Afin d’évaluer les performances du système, le nombre de tâches rejetées, la complexité de l’algorithme et la résilience estimée en injectant des fautes sont principalement analysés. La recherche est appliquée (i) à l’approche de « primary/backup » qui est une technique de tolérance aux fautes basée sur deux copies d’une tâche et (ii) aux algorithmes de placement pour les petits satellites appelés CubeSats. Quant à l’approche de « primary/backup », l’objectif principal est d’étudier les stratégies d’allocation des processeurs, de proposer de nouvelles méthodes d’amélioration pour l’ordonnancement et d’en choisir une qui diminue considérablement la durée de l’exécution de l’algorithme sans dégrader les performances du système. En ce qui concerne les CubeSats, l’idée est de regrouper tous les processeurs à bord et de concevoir des algorithmes d’ordonnancement afin de rendre les CubeSats plus robustes. Les scénarios provenant de deux CubeSats réels sont étudiés et les résultats montrent qu’il est inutile de considérer les systèmes ayant plus de six processeurs et que les algorithmes proposés fonctionnent bien même avec des capacités énergétiques limitées et dans un environnement hostile

INRIA a CCSD electronic archive server

An Efficient Uplink Multi-Connectivity Scheme for 5G mmWave Control Plane Applications

Author: Giordani Marco
Mezzavilla Marco
Rangan Sundeep
Zorzi Michele
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 21/07/2017
Field of study

The millimeter wave (mmWave) frequencies offer the potential of orders of magnitude increases in capacity for next-generation cellular systems. However, links in mmWave networks are susceptible to blockage and may suffer from rapid variations in quality. Connectivity to multiple cells - at mmWave and/or traditional frequencies - is considered essential for robust communication. One of the challenges in supporting multi-connectivity in mmWaves is the requirement for the network to track the direction of each link in addition to its power and timing. To address this challenge, we implement a novel uplink measurement system that, with the joint help of a local coordinator operating in the legacy band, guarantees continuous monitoring of the channel propagation conditions and allows for the design of efficient control plane applications, including handover, beam tracking and initial access. We show that an uplink-based multi-connectivity approach enables less consuming, better performing, faster and more stable cell selection and scheduling decisions with respect to a traditional downlink-based standalone scheme. Moreover, we argue that the presented framework guarantees (i) efficient tracking of the user in the presence of the channel dynamics expected at mmWaves, and (ii) fast reaction to situations in which the primary propagation path is blocked or not available.Comment: Submitted for publication in IEEE Transactions on Wireless Communications (TWC

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Padova

Recommended from our members

Towards a Fault-tolerant, Scheduling Methodology for Safety-critical Certified Information Systems

Author: Lin Jian
Publication venue: CSUSB ScholarWorks
Publication date: 01/01/2019
Field of study

Today, many critical information systems have safety-critical and non-safety-critical functions executed on the same platform in order to reduce design and implementation costs. The set of safety-critical functionality is subject to certification requirements and the rest of the functionality does not need to be certified, or is certified to a lower level. The resulting mixed-criticality systems bring challenges in designing such systems, especially when the critical tasks are required to complete with a timing constraint. This paper studies a problem of scheduling a mixed-criticality system with fault tolerance. A fault-recovery technique called checkpointing is used where a program can go back to a recent checkpoint for re-execution upon errors occurred. A novel schedulability test is derived to ensure that the safety-critical tasks are completed before their deadlines and the theoretical correctness is shown

CSUSB ScholarWorks