163 research outputs found

    Fault Tolerant Scheduling of Precedence Task Graphs on Heterogeneous Platforms

    Get PDF
    Fault tolerance and latency are important requirements in several applications which are time critical in nature: such applications require guaranties in terms of latency, even when processors are subject to failures. In this paper, we propose a fault tolerant scheduling heuristic for mapping precedence task graphs on heterogeneous systems. Our approach is based on an active replication scheme, capable of supporting Δ\varepsilon arbitrary fail-silent (fail-stop) processor failures, hence valid results will be provided even if Δ\varepsilon processors fail. We focus on a bi-criteria approach, where we aim at minimizing the latency given a fixed number of failures supported in the system, or the other way round. Major achievements include a low complexity, and a drastic reduction of the number of additional communications induced by the replication mechanism. Experimental results demonstrate that our heuristics, despite their lower complexity, outperform their direct competitor, the FTBAR scheduling algorithm[8].La tolĂ©rance aux pannes et la latence sont deux critĂšres importants pour plusieurs applications qui sont critiques par nature. Ce type d’applications exige des garanties en terme de temps de latence, mĂȘme lorsque les processeurs sont sujets aux pannes. Dans ce rapport, nous proposons une heuristique tolĂ©rante aux pannes pour l’ordonnancement de graphes de tĂąches sur des systĂšmes hĂ©tĂ©rogĂšnes. Notre approche est basĂ©e sur un mĂ©canisme de rĂ©plication active, capable de supporter " pannes arbitraires de type silence sur dĂ©faillance. En d’autres termes, des rĂ©sultats valides seront fournis mĂȘme si " processeurs tombent en panne. Nous nous concentrons sur une approche bi-critĂšre, oĂč nous avons pour objectif de minimiser le temps de latence pour un nombre donnĂ© (fixĂ©) de pannes tolĂ©rĂ©es dans le systĂšme, ou l’inverse. Les principales contributions incluent une faible complexitĂ© en temps d’exĂ©cution, et une rĂ©duction importante du nombre de communications induites par le mĂ©canisme de rĂ©plication.Les rĂ©sultats expĂ©rimentaux montrent que notre algorithme, en dĂ©pit de sa faible complexitĂ© temporelle, est meilleur que son direct compĂ©titeur,l’algorithme FTBA

    A Fault Tolerant Scheduling Algorithm for DAG Applications in Cluster Environments.

    Get PDF
    Abstract. Fault tolerance is an essential requirement in systems running applications which need a technique to continue execution where some system components are subject to failure. In this paper, a fault tolerant task scheduling algorithm is proposed for mapping task graphs to heterogeneous processing nodes in cluster computing systems. The starting point of the algorithm is a DAG representing an application with information about the tasks. This information consists of the execution time of the tasks on the target system processors, communication times between the tasks having data dependencies, and the number of the processor failures (Δ) which should be tolerated by the scheduling algorithm. The algorithm is based on the active replication scheme, and it schedules Δ+1 replicas of each task to achieve the required fault tolerance. Simulation results show the efficiency of the proposed algorithm in spite of its lower complexity

    Self-adaptivity of applications on network on chip multiprocessors: the case of fault-tolerant Kahn process networks

    Get PDF
    Technology scaling accompanied with higher operating frequencies and the ability to integrate more functionality in the same chip has been the driving force behind delivering higher performance computing systems at lower costs. Embedded computing systems, which have been riding the same wave of success, have evolved into complex architectures encompassing a high number of cores interconnected by an on-chip network (usually identified as Multiprocessor System-on-Chip). However these trends are hindered by issues that arise as technology scaling continues towards deep submicron scales. Firstly, growing complexity of these systems and the variability introduced by process technologies make it ever harder to perform a thorough optimization of the system at design time. Secondly, designers are faced with a reliability wall that emerges as age-related degradation reduces the lifetime of transistors, and as the probability of defects escaping post-manufacturing testing is increased. In this thesis, we take on these challenges within the context of streaming applications running in network-on-chip based parallel (not necessarily homogeneous) systems-on-chip that adopt the no-remote memory access model. In particular, this thesis tackles two main problems: (1) fault-aware online task remapping, (2) application-level self-adaptation for quality management. For the former, by viewing fault tolerance as a self-adaptation aspect, we adopt a cross-layer approach that aims at graceful performance degradation by addressing permanent faults in processing elements mostly at system-level, in particular by exploiting redundancy available in multi-core platforms. We propose an optimal solution based on an integer linear programming formulation (suitable for design time adoption) as well as heuristic-based solutions to be used at run-time. We assess the impact of our approach on the lifetime reliability. We propose two recovery schemes based on a checkpoint-and-rollback and a rollforward technique. For the latter, we propose two variants of a monitor-controller- adapter loop that adapts application-level parameters to meet performance goals. We demonstrate not only that fault tolerance and self-adaptivity can be achieved in embedded platforms, but also that it can be done without incurring large overheads. In addressing these problems, we present techniques which have been realized (depending on their characteristics) in the form of a design tool, a run-time library or a hardware core to be added to the basic architecture

    On the Design of Real-Time Systems on Multi-Core Platforms under Uncertainty

    Get PDF
    Real-time systems are computing systems that demand the assurance of not only the logical correctness of computational results but also the timing of these results. To ensure timing constraints, traditional real-time system designs usually adopt a worst-case based deterministic approach. However, such an approach is becoming out of sync with the continuous evolution of IC technology and increased complexity of real-time applications. As IC technology continues to evolve into the deep sub-micron domain, process variation causes processor performance to vary from die to die, chip to chip, and even core to core. The extensive resource sharing on multi-core platforms also significantly increases the uncertainty when executing real-time tasks. The traditional approach can only lead to extremely pessimistic, and thus, unpractical design of real-time systems. Our research seeks to address the uncertainty problem when designing real-time systems on multi-core platforms. We first attacked the uncertainty problem caused by process variation. We proposed a virtualization framework and developed techniques to optimize the system\u27s performance under process variation. We further studied the problem on peak temperature minimization for real-time applications on multi-core platforms. Three heuristics were developed to reduce the peak temperature for real-time systems. Next, we sought to address the uncertainty problem in real-time task execution times by developing statistical real-time scheduling techniques. We studied the problem of fixed-priority real-time scheduling of implicit periodic tasks with probabilistic execution times on multi-core platforms. We further extended our research for tasks with explicit deadlines. We introduced the concept of harmonic to a more general task set, i.e. tasks with explicit deadlines, and developed new task partitioning techniques. Throughout our research, we have conducted extensive simulations to study the effectiveness and efficiency of our developed techniques. The increasing process variation and the ever-increasing scale and complexity of real-time systems both demand a paradigm shift in the design of real-time applications. Effectively dealing with the uncertainty in design of real-time applications is a challenging but also critical problem. Our research is such an effort in this endeavor, and we conclude this dissertation with discussions of potential future work

    FANTOM: Fault Tolerant Task-Drop Aware Scheduling for Mixed-Criticality Systems

    Get PDF
    Mixed-Criticality (MC) systems have emerged as an effective solution in various industries, where multiple tasks with various real-time and safety requirements (different levels of criticality) are integrated onto a common hardware platform. In these systems, a fault may occur due to different reasons, e.g., hardware defects, software errors or the arrival of unexpected events. In order to tolerate faults in MC systems, the re-execution technique is typically employed, which may lead to overrun of high-criticality tasks (HCTs), which necessitates the drop of low-criticality tasks (LCTs) or degrading their quality. However, frequent drops or relatively long execution times of LCTs (especially mission-critical tasks) are not always desirable and it may impose a negative impact on the performance, or the functionality of MC systems. In this regard, this article proposes a realistic MC task model and develops a design-time task-drop aware schedulability analysis based on the Earliest Deadline First with Virtual Deadline (EDF-VD) algorithm. According to this analysis and the proposed scheduling policy based on the new MC task model, in the high-criticality (HI) mode, when an HCT overruns and the system switches to the HI mode, the number of drops per LCT is prohibited from passing a predefined threshold. In addition, to guarantee the real-time constraints and safety requirements of MC tasks in the presence of faults (assuming transient faults in this article), a corresponding scheduling mechanism has been developed. According to the obtained results from an extensive set of simulations, which have been validated through a realistic avionic application, the proposed method improves the acceptance ratio by up to 43.9% compared to state-of-the-art

    Adaptive Computing Systems for Aerospace

    Get PDF
    RÉSUMÉ En raison de leur complexitĂ© croissante, les systĂšmes informatiques modernes nĂ©cessitent de nouvelles mĂ©thodologies permettant d’automatiser leur conception et d’amĂ©liorer leurs performances. L’espace, en particulier, constitue un environnement trĂšs dĂ©favorable au maintien de la performance de ces systĂšmes : sans protection des rayonnements ionisants et des particules, l’électronique basĂ©e sur CMOS peut subir des erreurs transitoires, une dĂ©gradation des performances et une usure accĂ©lĂ©rĂ©e causant ultimement une dĂ©faillance du systĂšme. Les approches traditionnellement adoptees pour garantir la fiabilitĂ© du systĂšme et prolonger sa durĂ©e de vie sont basĂ©es sur la redondance, gĂ©nĂ©ralement Ă©tablie durant la conception. En revanche, ces solutions sont coĂ»teuses et parfois inefficaces, puisqu'elles augmentent la taille et la complexitĂ© du systĂšme, l'exposant Ă  des risques plus Ă©levĂ©s de surchauffe et d'erreurs. Les consĂ©quences de ces limites sont d'autant plus importantes lorsqu'elles s’appliquent aux systĂšmes critiques (e.g., contraintes par le temps ou dont l’accĂšs est limitĂ©) qui doivent ĂȘtre en mesure de prendre des dĂ©cisions sans intervention humaine. Sur la base de ces besoins et limites, le dĂ©veloppement en aĂ©rospatial de systĂšmes informatiques avec capacitĂ©s adaptatives peut ĂȘtre considĂ©rĂ© comme la solution la plus appropriĂ©e pour les dispositifs intĂ©grĂ©s Ă  haute performance. L’informatique auto-adaptative offre un potentiel sans Ă©gal pour assurer la crĂ©ation d’une gĂ©nĂ©ration d’ordinateurs plus intelligents et fiables. Qui plus est, elle rĂ©pond aux besoins modernes de concevoir et programmer des systĂšmes informatiques capables de rĂ©pondre Ă  des objectifs en conflit. En nous inspirant des domaines de l’intelligence artificielle et des systĂšmes reconfigurables, nous aspirons Ă  dĂ©velopper des systĂšmes informatiques auto-adaptatifs pour l’aĂ©rospatiale qui rĂ©pondent aux enjeux et besoins actuels. Notre objectif est d’amĂ©liorer l’efficacitĂ© de ces systĂšmes, leur tolerance aux pannes et leur capacitĂ© de calcul. Afin d’atteindre cet objectif, une analyse expĂ©rimentale et comparative des algorithmes les plus populaires pour l’exploration multi-objectifs de l’espace de conception est d’abord effectuĂ©e. Les algorithmes ont Ă©tĂ© recueillis suite Ă  une revue de la plus rĂ©cente littĂ©rature et comprennent des mĂ©thodes heuristiques, Ă©volutives et statistiques. L’analyse et la comparaison de ceux-ci permettent de cerner les forces et limites de chacun et d'ainsi dĂ©finir des lignes directrices favorisant un choix optimal d’algorithmes d’exploration. Pour la crĂ©ation d’un systĂšme d’optimisation autonome—permettant le compromis entre plusieurs objectifs—nous exploitons les capacitĂ©s des modĂšles graphiques probabilistes. Nous introduisons une mĂ©thodologie basĂ©e sur les modĂšles de Markov cachĂ©s dynamiques, laquelle permet d’équilibrer la disponibilitĂ© et la durĂ©e de vie d’un systĂšme multiprocesseur. Ceci est obtenu en estimant l'occurrence des erreurs permanentes parmi les erreurs transitoires et en migrant dynamiquement le calcul sur les ressources supplĂ©mentaires en cas de dĂ©faillance. La nature dynamique du modĂšle rend celui-ci adaptable Ă  diffĂ©rents profils de mission et taux d’erreur. Les rĂ©sultats montrent que nous sommes en mesure de prolonger la durĂ©e de vie du systĂšme tout en conservant une disponibilitĂ© proche du cas idĂ©al. En raison des contraintes de temps rigoureuses imposĂ©es par les systĂšmes aĂ©rospatiaux, nous Ă©tudions aussi l’optimisation de la tolĂ©rance aux pannes en prĂ©sence d'exigences d’exĂ©cution en temps rĂ©el. Nous proposons une mĂ©thodologie pour amĂ©liorer la fiabilitĂ© du calcul en prĂ©sence d’erreurs transitoires pour les tĂąches en temps rĂ©el d’un systĂšme multiprocesseur homogĂšne avec des capacitĂ©s de rĂ©glage de tension et de frĂ©quence. Dans ce cadre, nous dĂ©finissons un nouveau compromis probabiliste entre la consommation d’énergie et la tolĂ©rance aux erreurs. Comme nous reconnaissons que la rĂ©silience est une propriĂ©tĂ© d’intĂ©rĂȘt omniprĂ©sente (par exemple, pour la conception et l’analyse de systems complexes gĂ©nĂ©riques), nous adaptons une dĂ©finition formelle de celle-ci Ă  un cadre probabiliste dĂ©rivĂ© Ă  nouveau de modĂšles de Markov cachĂ©s. Ce cadre nous permet de modĂ©liser de façon rĂ©aliste l’évolution stochastique et l’observabilitĂ© partielle des phĂ©nomĂšnes du monde rĂ©el. Nous proposons un algorithme permettant le calcul exact efficace de l’étape essentielle d’infĂ©rence laquelle est requise pour vĂ©rifier des propriĂ©tĂ©s gĂ©nĂ©riques. Pour dĂ©montrer la flexibilitĂ© de cette approche, nous la validons, entre autres, dans le contexte d’un systĂšme informatisĂ© reconfigurable pour l’aĂ©rospatiale. Enfin, nous Ă©tendons la portĂ©e de nos recherches vers la robotique et les systĂšmes multi-agents, deux sujets dont la popularitĂ© est croissante en exploration spatiale. Nous abordons le problĂšme de l’évaluation et de l’entretien de la connectivitĂ© dans le context distribuĂ© et auto-adaptatif de la robotique en essaim. Nous examinons les limites des solutions existantes et proposons une nouvelle mĂ©thodologie pour crĂ©er des gĂ©omĂ©tries complexes connectĂ©es gĂ©rant plusieurs tĂąches simultanĂ©ment. Des contributions additionnelles dans plusieurs domaines sont rĂ©sumĂ©s dans les annexes, nommĂ©ment : (i) la conception de CubeSats, (ii) la modĂ©lisation des rayonnements spatiaux pour l’injection d’erreur dans FPGA et (iii) l’analyse temporelle probabiliste pour les systĂšmes en temps rĂ©el. À notre avis, cette recherche constitue un tremplin utile vers la crĂ©ation d’une nouvelle gĂ©nĂ©ration de systĂšmes informatiques qui exĂ©cutent leurs tĂąches d’une façon autonome et fiable, favorisant une exploration spatiale plus simple et moins coĂ»teuse.----------ABSTRACT Today's computer systems are growing more and more complex at a pace that requires the development of novel and more effective methodologies to automate their design. Space, in particular, represents a challenging environment: without protection from ionizing and particle radiation, CMOS-based electronics are subject to transients faults, performance degradation, accelerated wear, and, ultimately, system failure. Traditional approaches adopted to guarantee reliability and extended lifetime are based on redundancy that is established at design-time. These solutions are expensive and sometimes inefficient, as they increase the complexity and size of a system, exposing it to higher risks of overheating and incurring in radiation-induced errors. Moreover, critical systems---e.g., time-constrained ones and those where access is limited---must be able to cope with pivotal situations without relying on human intervention. Hence, the emerging interest in computer systems with adaptive capabilities as the most suitable solution for novel high-performance embedded devices for aerospace. Self-adaptive computing carries unmatched potential and great promises for the creation of a new generation of smart, more reliable computers, and it addresses the challenge of designing and programming modern and future computer systems that must meet conflicting goals. Drawing from the fields of artificial intelligence and reconfigurable systems, we aim at developing self-adaptive computer systems for aerospace. Our goal is to improve their efficiency, fault-tolerance, and computational capabilities. The first step in this research is the experimental analysis of the most popular multi-objective design-space exploration algorithms for high-level design. These algorithms were collected from the recent literature and include heuristic, evolutionary, and statistical methods. Their comparison provides insights that we use to define guidelines for the choice of the most appropriate optimization algorithms, given the features of the design space. For the creation of a self-managing optimization framework---enabling the adaptive trade-off of multiple objectives---we leverage the tools of probabilistic graphical models. We introduce a mechanism based on dynamic hidden Markov models that balances the availability and lifetime of multiprocessor systems. This is achieved by estimating the occurrence of permanent faults amid transient faults, and by dynamically migrating the computation on excess resources, when failure occurs. The dynamic nature of the model makes it adjustable to different mission profiles and fault rates. The results show that we are able to lead systems to extended lifetimes, while keeping their availability close to ideal. On account of the stringent timing constraints imposed by aerospace systems, we then investigate the optimization of fault-tolerance under real-time requirements. We propose a methodology to improve the reliability of computation in the presence of transient errors when considering the mapping of real-time tasks on a homogeneous multiprocessor system with voltage and frequency scaling capabilities. In this framework, we take advantage of probability theory to define a novel trade-off between power consumption and fault-tolerance. As we recognize that resilience is a pervasive property of interest (e.g., for the design and analysis of generic complex systems), we adapt a formal definition of it to one more probabilistic framework derived from hidden Markov models. This allows us to realistically model the stochastic evolution and partial observability of complex real-world environments. Within this framework, we propose an efficient algorithm for the exact computation of the essential inference step required to construct generic property checking. To demonstrate the flexibility of this approach, we validate it in the context, among others, of a self-aware, reconfigurable computing system for aerospace. Finally, we move the scope of our research towards robotics and multi-agent systems: a topic of thriving popularity for space exploration. We tackle the problem of connectivity assessment and maintenance in the distributed and self-adaptive context of swarm robotics. We review the limitations of existing solutions and propose a novel methodology to create connected complex geometries for multiple task coverage. Additional contributions in the areas of (i) CubeSat design, (ii) the modelling of space radiation for FPGA fault-injection, and (iii) probabilistic timing analysis for real-time systems are summarized in the appendices. In the author's opinion, this research provides a number of useful stepping stones for the creation of a new generation of computing systems that autonomously---and reliably---perform their tasks for longer periods of time, fostering simpler and cheaper space exploration

    Design Space Exploration and Resource Management of Multi/Many-Core Systems

    Get PDF
    The increasing demand of processing a higher number of applications and related data on computing platforms has resulted in reliance on multi-/many-core chips as they facilitate parallel processing. However, there is a desire for these platforms to be energy-efficient and reliable, and they need to perform secure computations for the interest of the whole community. This book provides perspectives on the aforementioned aspects from leading researchers in terms of state-of-the-art contributions and upcoming trends

    Timing Predictability in Future Multi-Core Avionics Systems

    Full text link
    • 

    corecore