Search CORE

163 research outputs found

Fault Tolerant Scheduling of Precedence Task Graphs on Heterogeneous Platforms

Author: Benoit Anne
Hakem Mourad
Robert Yves
Publication venue: HAL CCSD
Publication date: 01/01/2007
Field of study

Fault tolerance and latency are important requirements in several applications which are time critical in nature: such applications require guaranties in terms of latency, even when processors are subject to failures. In this paper, we propose a fault tolerant scheduling heuristic for mapping precedence task graphs on heterogeneous systems. Our approach is based on an active replication scheme, capable of supporting

\varepsilon

arbitrary fail-silent (fail-stop) processor failures, hence valid results will be provided even if

\varepsilon

processors fail. We focus on a bi-criteria approach, where we aim at minimizing the latency given a fixed number of failures supported in the system, or the other way round. Major achievements include a low complexity, and a drastic reduction of the number of additional communications induced by the replication mechanism. Experimental results demonstrate that our heuristics, despite their lower complexity, outperform their direct competitor, the FTBAR scheduling algorithm[8].La tolérance aux pannes et la latence sont deux critères importants pour plusieurs applications qui sont critiques par nature. Ce type d’applications exige des garanties en terme de temps de latence, même lorsque les processeurs sont sujets aux pannes. Dans ce rapport, nous proposons une heuristique tolérante aux pannes pour l’ordonnancement de graphes de tâches sur des systèmes hétérogènes. Notre approche est basée sur un mécanisme de réplication active, capable de supporter " pannes arbitraires de type silence sur défaillance. En d’autres termes, des résultats valides seront fournis même si " processeurs tombent en panne. Nous nous concentrons sur une approche bi-critère, où nous avons pour objectif de minimiser le temps de latence pour un nombre donné (fixé) de pannes tolérées dans le système, ou l’inverse. Les principales contributions incluent une faible complexité en temps d’exécution, et une réduction importante du nombre de communications induites par le mécanisme de réplication.Les résultats expérimentaux montrent que notre algorithme, en dépit de sa faible complexité temporelle, est meilleur que son direct compétiteur,l’algorithme FTBA

HAL-ENS-LYON

CiteSeerX

HAL - Université de Franche-Comté

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot

Multiprocessor scheduling algorithm for low overhead fault-tolerance

Author: Hashimoto Koji
Kikuno Tohru
Tsuchiya Tatsuhiro
キクノトオル
ツチヤタツヒロ
土屋達弘
菊野亨
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/10/1998
Field of study

Osaka University Knowledge Archive

A Fault Tolerant Scheduling Algorithm for DAG Applications in Cluster Environments.

Author: Ali Movaghar
Nabil Tabbaa
Reza Entezari-Maleki
Publication venue
Publication date: 01/01/2011
Field of study

Abstract. Fault tolerance is an essential requirement in systems running applications which need a technique to continue execution where some system components are subject to failure. In this paper, a fault tolerant task scheduling algorithm is proposed for mapping task graphs to heterogeneous processing nodes in cluster computing systems. The starting point of the algorithm is a DAG representing an application with information about the tasks. This information consists of the execution time of the tasks on the target system processors, communication times between the tasks having data dependencies, and the number of the processor failures (ε) which should be tolerated by the scheduling algorithm. The algorithm is based on the active replication scheme, and it schedules ε+1 replicas of each task to achieve the required fault tolerance. Simulation results show the efficiency of the proposed algorithm in spite of its lower complexity

CiteSeerX

Self-adaptivity of applications on network on chip multiprocessors: the case of fault-tolerant Kahn process networks

Author: Derin Onur
Sami Mariagiovanna
Publication venue
Publication date: 19/10/2015
Field of study

Technology scaling accompanied with higher operating frequencies and the ability to integrate more functionality in the same chip has been the driving force behind delivering higher performance computing systems at lower costs. Embedded computing systems, which have been riding the same wave of success, have evolved into complex architectures encompassing a high number of cores interconnected by an on-chip network (usually identified as Multiprocessor System-on-Chip). However these trends are hindered by issues that arise as technology scaling continues towards deep submicron scales. Firstly, growing complexity of these systems and the variability introduced by process technologies make it ever harder to perform a thorough optimization of the system at design time. Secondly, designers are faced with a reliability wall that emerges as age-related degradation reduces the lifetime of transistors, and as the probability of defects escaping post-manufacturing testing is increased. In this thesis, we take on these challenges within the context of streaming applications running in network-on-chip based parallel (not necessarily homogeneous) systems-on-chip that adopt the no-remote memory access model. In particular, this thesis tackles two main problems: (1) fault-aware online task remapping, (2) application-level self-adaptation for quality management. For the former, by viewing fault tolerance as a self-adaptation aspect, we adopt a cross-layer approach that aims at graceful performance degradation by addressing permanent faults in processing elements mostly at system-level, in particular by exploiting redundancy available in multi-core platforms. We propose an optimal solution based on an integer linear programming formulation (suitable for design time adoption) as well as heuristic-based solutions to be used at run-time. We assess the impact of our approach on the lifetime reliability. We propose two recovery schemes based on a checkpoint-and-rollback and a rollforward technique. For the latter, we propose two variants of a monitor-controller- adapter loop that adapts application-level parameters to meet performance goals. We demonstrate not only that fault tolerance and self-adaptivity can be achieved in embedded platforms, but also that it can be done without incurring large overheads. In addressing these problems, we present techniques which have been realized (depending on their characteristics) in the form of a design tool, a run-time library or a hardware core to be added to the basic architecture

RERO DOC Digital Library

On the Design of Real-Time Systems on Multi-Core Platforms under Uncertainty

Author: WANG TIANYI
Publication venue: FIU Digital Commons
Publication date: 01/01/2015
Field of study

Real-time systems are computing systems that demand the assurance of not only the logical correctness of computational results but also the timing of these results. To ensure timing constraints, traditional real-time system designs usually adopt a worst-case based deterministic approach. However, such an approach is becoming out of sync with the continuous evolution of IC technology and increased complexity of real-time applications. As IC technology continues to evolve into the deep sub-micron domain, process variation causes processor performance to vary from die to die, chip to chip, and even core to core. The extensive resource sharing on multi-core platforms also significantly increases the uncertainty when executing real-time tasks. The traditional approach can only lead to extremely pessimistic, and thus, unpractical design of real-time systems. Our research seeks to address the uncertainty problem when designing real-time systems on multi-core platforms. We first attacked the uncertainty problem caused by process variation. We proposed a virtualization framework and developed techniques to optimize the system\u27s performance under process variation. We further studied the problem on peak temperature minimization for real-time applications on multi-core platforms. Three heuristics were developed to reduce the peak temperature for real-time systems. Next, we sought to address the uncertainty problem in real-time task execution times by developing statistical real-time scheduling techniques. We studied the problem of fixed-priority real-time scheduling of implicit periodic tasks with probabilistic execution times on multi-core platforms. We further extended our research for tasks with explicit deadlines. We introduced the concept of harmonic to a more general task set, i.e. tasks with explicit deadlines, and developed new task partitioning techniques. Throughout our research, we have conducted extensive simulations to study the effectiveness and efficiency of our developed techniques. The increasing process variation and the ever-increasing scale and complexity of real-time systems both demand a paradigm shift in the design of real-time applications. Effectively dealing with the uncertainty in design of real-time applications is a challenging but also critical problem. Our research is such an effort in this endeavor, and we conclude this dissertation with discussions of potential future work

DigitalCommons@Florida International University

FANTOM: Fault Tolerant Task-Drop Aware Scheduling for Mixed-Criticality Systems

Author: Ejlali Alireza
Kumar Akash
Ranjbar Behnaz
Safaei Bardia
Publication venue: Institute of Electrical and Electronics Engineers
Publication date: 27/01/2021
Field of study

Mixed-Criticality (MC) systems have emerged as an effective solution in various industries, where multiple tasks with various real-time and safety requirements (different levels of criticality) are integrated onto a common hardware platform. In these systems, a fault may occur due to different reasons, e.g., hardware defects, software errors or the arrival of unexpected events. In order to tolerate faults in MC systems, the re-execution technique is typically employed, which may lead to overrun of high-criticality tasks (HCTs), which necessitates the drop of low-criticality tasks (LCTs) or degrading their quality. However, frequent drops or relatively long execution times of LCTs (especially mission-critical tasks) are not always desirable and it may impose a negative impact on the performance, or the functionality of MC systems. In this regard, this article proposes a realistic MC task model and develops a design-time task-drop aware schedulability analysis based on the Earliest Deadline First with Virtual Deadline (EDF-VD) algorithm. According to this analysis and the proposed scheduling policy based on the new MC task model, in the high-criticality (HI) mode, when an HCT overruns and the system switches to the HI mode, the number of drops per LCT is prohibited from passing a predefined threshold. In addition, to guarantee the real-time constraints and safety requirements of MC tasks in the presence of faults (assuming transient faults in this article), a corresponding scheduling mechanism has been developed. According to the obtained results from an extensive set of simulations, which have been validated through a realistic avionic application, the proposed method improves the acceptance ratio by up to 43.9% compared to state-of-the-art

KITopen

Adaptive Computing Systems for Aerospace

Author: Panerati Jacopo
Publication venue
Publication date: 01/05/2017
Field of study

RÉSUMÉ En raison de leur complexité croissante, les systèmes informatiques modernes nécessitent de nouvelles méthodologies permettant d’automatiser leur conception et d’améliorer leurs performances. L’espace, en particulier, constitue un environnement très défavorable au maintien de la performance de ces systèmes : sans protection des rayonnements ionisants et des particules, l’électronique basée sur CMOS peut subir des erreurs transitoires, une dégradation des performances et une usure accélérée causant ultimement une défaillance du système. Les approches traditionnellement adoptees pour garantir la fiabilité du système et prolonger sa durée de vie sont basées sur la redondance, généralement établie durant la conception. En revanche, ces solutions sont coûteuses et parfois inefficaces, puisqu'elles augmentent la taille et la complexité du système, l'exposant à des risques plus élevés de surchauffe et d'erreurs. Les conséquences de ces limites sont d'autant plus importantes lorsqu'elles s’appliquent aux systèmes critiques (e.g., contraintes par le temps ou dont l’accès est limité) qui doivent être en mesure de prendre des décisions sans intervention humaine. Sur la base de ces besoins et limites, le développement en aérospatial de systèmes informatiques avec capacités adaptatives peut être considéré comme la solution la plus appropriée pour les dispositifs intégrés à haute performance. L’informatique auto-adaptative offre un potentiel sans égal pour assurer la création d’une génération d’ordinateurs plus intelligents et fiables. Qui plus est, elle répond aux besoins modernes de concevoir et programmer des systèmes informatiques capables de répondre à des objectifs en conflit. En nous inspirant des domaines de l’intelligence artificielle et des systèmes reconfigurables, nous aspirons à développer des systèmes informatiques auto-adaptatifs pour l’aérospatiale qui répondent aux enjeux et besoins actuels. Notre objectif est d’améliorer l’efficacité de ces systèmes, leur tolerance aux pannes et leur capacité de calcul. Afin d’atteindre cet objectif, une analyse expérimentale et comparative des algorithmes les plus populaires pour l’exploration multi-objectifs de l’espace de conception est d’abord effectuée. Les algorithmes ont été recueillis suite à une revue de la plus récente littérature et comprennent des méthodes heuristiques, évolutives et statistiques. L’analyse et la comparaison de ceux-ci permettent de cerner les forces et limites de chacun et d'ainsi définir des lignes directrices favorisant un choix optimal d’algorithmes d’exploration. Pour la création d’un système d’optimisation autonome—permettant le compromis entre plusieurs objectifs—nous exploitons les capacités des modèles graphiques probabilistes. Nous introduisons une méthodologie basée sur les modèles de Markov cachés dynamiques, laquelle permet d’équilibrer la disponibilité et la durée de vie d’un système multiprocesseur. Ceci est obtenu en estimant l'occurrence des erreurs permanentes parmi les erreurs transitoires et en migrant dynamiquement le calcul sur les ressources supplémentaires en cas de défaillance. La nature dynamique du modèle rend celui-ci adaptable à différents profils de mission et taux d’erreur. Les résultats montrent que nous sommes en mesure de prolonger la durée de vie du système tout en conservant une disponibilité proche du cas idéal. En raison des contraintes de temps rigoureuses imposées par les systèmes aérospatiaux, nous étudions aussi l’optimisation de la tolérance aux pannes en présence d'exigences d’exécution en temps réel. Nous proposons une méthodologie pour améliorer la fiabilité du calcul en présence d’erreurs transitoires pour les tâches en temps réel d’un système multiprocesseur homogène avec des capacités de réglage de tension et de fréquence. Dans ce cadre, nous définissons un nouveau compromis probabiliste entre la consommation d’énergie et la tolérance aux erreurs. Comme nous reconnaissons que la résilience est une propriété d’intérêt omniprésente (par exemple, pour la conception et l’analyse de systems complexes génériques), nous adaptons une définition formelle de celle-ci à un cadre probabiliste dérivé à nouveau de modèles de Markov cachés. Ce cadre nous permet de modéliser de façon réaliste l’évolution stochastique et l’observabilité partielle des phénomènes du monde réel. Nous proposons un algorithme permettant le calcul exact efficace de l’étape essentielle d’inférence laquelle est requise pour vérifier des propriétés génériques. Pour démontrer la flexibilité de cette approche, nous la validons, entre autres, dans le contexte d’un système informatisé reconfigurable pour l’aérospatiale. Enfin, nous étendons la portée de nos recherches vers la robotique et les systèmes multi-agents, deux sujets dont la popularité est croissante en exploration spatiale. Nous abordons le problème de l’évaluation et de l’entretien de la connectivité dans le context distribué et auto-adaptatif de la robotique en essaim. Nous examinons les limites des solutions existantes et proposons une nouvelle méthodologie pour créer des géométries complexes connectées gérant plusieurs tâches simultanément. Des contributions additionnelles dans plusieurs domaines sont résumés dans les annexes, nommément : (i) la conception de CubeSats, (ii) la modélisation des rayonnements spatiaux pour l’injection d’erreur dans FPGA et (iii) l’analyse temporelle probabiliste pour les systèmes en temps réel. À notre avis, cette recherche constitue un tremplin utile vers la création d’une nouvelle génération de systèmes informatiques qui exécutent leurs tâches d’une façon autonome et fiable, favorisant une exploration spatiale plus simple et moins coûteuse.----------ABSTRACT Today's computer systems are growing more and more complex at a pace that requires the development of novel and more effective methodologies to automate their design. Space, in particular, represents a challenging environment: without protection from ionizing and particle radiation, CMOS-based electronics are subject to transients faults, performance degradation, accelerated wear, and, ultimately, system failure. Traditional approaches adopted to guarantee reliability and extended lifetime are based on redundancy that is established at design-time. These solutions are expensive and sometimes inefficient, as they increase the complexity and size of a system, exposing it to higher risks of overheating and incurring in radiation-induced errors. Moreover, critical systems---e.g., time-constrained ones and those where access is limited---must be able to cope with pivotal situations without relying on human intervention. Hence, the emerging interest in computer systems with adaptive capabilities as the most suitable solution for novel high-performance embedded devices for aerospace. Self-adaptive computing carries unmatched potential and great promises for the creation of a new generation of smart, more reliable computers, and it addresses the challenge of designing and programming modern and future computer systems that must meet conflicting goals. Drawing from the fields of artificial intelligence and reconfigurable systems, we aim at developing self-adaptive computer systems for aerospace. Our goal is to improve their efficiency, fault-tolerance, and computational capabilities. The first step in this research is the experimental analysis of the most popular multi-objective design-space exploration algorithms for high-level design. These algorithms were collected from the recent literature and include heuristic, evolutionary, and statistical methods. Their comparison provides insights that we use to define guidelines for the choice of the most appropriate optimization algorithms, given the features of the design space. For the creation of a self-managing optimization framework---enabling the adaptive trade-off of multiple objectives---we leverage the tools of probabilistic graphical models. We introduce a mechanism based on dynamic hidden Markov models that balances the availability and lifetime of multiprocessor systems. This is achieved by estimating the occurrence of permanent faults amid transient faults, and by dynamically migrating the computation on excess resources, when failure occurs. The dynamic nature of the model makes it adjustable to different mission profiles and fault rates. The results show that we are able to lead systems to extended lifetimes, while keeping their availability close to ideal. On account of the stringent timing constraints imposed by aerospace systems, we then investigate the optimization of fault-tolerance under real-time requirements. We propose a methodology to improve the reliability of computation in the presence of transient errors when considering the mapping of real-time tasks on a homogeneous multiprocessor system with voltage and frequency scaling capabilities. In this framework, we take advantage of probability theory to define a novel trade-off between power consumption and fault-tolerance. As we recognize that resilience is a pervasive property of interest (e.g., for the design and analysis of generic complex systems), we adapt a formal definition of it to one more probabilistic framework derived from hidden Markov models. This allows us to realistically model the stochastic evolution and partial observability of complex real-world environments. Within this framework, we propose an efficient algorithm for the exact computation of the essential inference step required to construct generic property checking. To demonstrate the flexibility of this approach, we validate it in the context, among others, of a self-aware, reconfigurable computing system for aerospace. Finally, we move the scope of our research towards robotics and multi-agent systems: a topic of thriving popularity for space exploration. We tackle the problem of connectivity assessment and maintenance in the distributed and self-adaptive context of swarm robotics. We review the limitations of existing solutions and propose a novel methodology to create connected complex geometries for multiple task coverage. Additional contributions in the areas of (i) CubeSat design, (ii) the modelling of space radiation for FPGA fault-injection, and (iii) probabilistic timing analysis for real-time systems are summarized in the appendices. In the author's opinion, this research provides a number of useful stepping stones for the creation of a new generation of computing systems that autonomously---and reliably---perform their tasks for longer periods of time, fostering simpler and cheaper space exploration

PolyPublie

Design Space Exploration and Resource Management of Multi/Many-Core Systems

Author
Publication venue: 'MDPI AG'
Publication date: 11/01/2022
Field of study

The increasing demand of processing a higher number of applications and related data on computing platforms has resulted in reliance on multi-/many-core chips as they facilitate parallel processing. However, there is a desire for these platforms to be energy-efficient and reliable, and they need to perform secure computations for the interest of the whole community. This book provides perspectives on the aforementioned aspects from leading researchers in terms of state-of-the-art contributions and upcoming trends

Directory of Open Access Books (DOAB)

Timing Predictability in Future Multi-Core Avionics Systems

Author
Publication venue: 'Linkoping University Electronic Press'
Publication date
Field of study

Crossref