254 research outputs found
Resilient random modulo cache memories for probabilistically-analyzable real-time systems
Fault tolerance has often been assessed separately in safety-related real-time systems, which may lead to inefficient solutions. Recently, Measurement-Based Probabilistic Timing Analysis (MBPTA) has been proposed to estimate Worst-Case Execution Time (WCET) on high performance hardware. The intrinsic probabilistic nature of MBPTA-commpliant hardware matches perfectly with the random nature of hardware faults.
Joint WCET analysis and reliability assessment has been done so far for some MBPTA-compliant designs, but not for the most promising cache design: random modulo. In this paper we perform, for the first time, an assessment of the aging-robustness of random modulo and propose new implementations preserving the key properties of random modulo, a.k.a. low critical path impact, low miss rates and MBPTA compliance, while enhancing reliability in front of aging by achieving a better – yet random – activity distribution across cache sets.Peer ReviewedPostprint (author's final draft
Probabilistic Worst-Case Timing Analysis: Taxonomy and Comprehensive Survey
"© ACM, 2019. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Computing Surveys, {VOL 52, ISS 1, (February 2019)} https://dl.acm.org/doi/10.1145/3301283"[EN] The unabated increase in the complexity of the hardware and software components of modern embedded real-time systems has given momentum to a host of research in the use of probabilistic and statistical techniques for timing analysis. In the last few years, that front of investigation has yielded a body of scientific literature vast enough to warrant some comprehensive taxonomy of motivations, strategies of application, and directions of research. This survey addresses this very need, singling out the principal techniques in the state of the art of timing analysis that employ probabilistic reasoning at some level, building a taxonomy of them, discussing their relative merit and limitations, and the relations among them. In addition to offering a comprehensive foundation to savvy probabilistic timing analysis, this article also identifies the key challenges to be addressed to consolidate the scientific soundness and industrial viability of this emerging field.This work has also been partially supported by the Spanish Ministry of Science and Innovation under grant TIN2015-65316-P, the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No. 772773), and the HiPEAC Network of Excellence. Jaume Abella was partially supported by the Ministry of Economy and Competitiveness under a Ramon y Cajal postdoctoral fellowship (RYC-2013-14717). Enrico Mezzetti has been partially supported by the Spanish Ministry of Economy and Competitiveness under Juan de la Cierva-Incorporación postdoctoral fellowship No. IJCI-2016-27396.Cazorla, FJ.; Kosmidis, L.; Mezzetti, E.; Hernández Luz, C.; Abella, J.; Vardanega, T. (2019). Probabilistic Worst-Case Timing Analysis: Taxonomy and Comprehensive Survey. ACM Computing Surveys. 52(1):1-35. https://doi.org/10.1145/3301283S13552
The Heptane Static Worst-Case Execution Time Estimation Tool
Estimation of worst-case execution times (WCETs) is required to validate the temporal behavior of hard real time systems. Heptane is an open-source software program that estimates upper bounds of execution times on MIPS and ARM v7 architectures, offered to the WCET estimation community to experiment new WCET estimation techniques. The software architecture of Heptane was designed to be as modular and extensible as possible to facilitate the integration of new approaches. This paper is devoted to a description of Heptane, and includes information on the analyses it implements, how to use it and extend it
A Survey of Probabilistic Timing Analysis Techniques for Real-Time Systems
This survey covers probabilistic timing analysis techniques for real-time systems. It reviews and critiques the key results in the field from its origins in 2000 to the latest research published up to the end of August 2018. The survey provides a taxonomy of the different methods used, and a classification of existing research. A detailed review is provided covering the main subject areas: static probabilistic timing analysis, measurement-based probabilistic timing analysis, and hybrid methods. In addition, research on supporting mechanisms and techniques, case studies, and evaluations is also reviewed. The survey concludes by identifying open issues, key challenges and possible directions for future research
Impact of Transient Faults on Timing Behavior and Mitigation with Near-Zero WCET Overhead
As time-critical systems require timing guarantees, Worst-Case Execution Times (WCET) have to be employed. However, WCET estimation methods usually assume fault-free hardware. If proper actions are not taken, such fault-free WCET approaches become unsafe, when faults impact the hardware during execution. The majority of approaches, dealing with hardware faults, address the impact of faults on the functional behavior of an application, i.e., denial of service and binary correctness. Few approaches address the impact of faults on the application timing behavior, i.e., time to finish the application, and target faults occurring in memories. However, as the transistor size in modern technologies is significantly reduced, faults in cores cannot be considered negligible anymore. This work shows that faults not only affect the functional behavior, but they can have a significant impact on the timing behavior of applications. To expose the overall impact of faults, we enhance vulnerability analysis to include not only functional, but also timing correctness, and show that faults impact WCET estimations. As common techniques to deal with faults, such as watchdog timers and re-execution, have large timing overhead for error detection and correction, we propose a mechanism with near-zero and bounded timing overhead. A RISC-V core is used as a case study. The obtained results show that faults can lead up to almost 700% increase in the maximum observed execution time between fault-free and faulty execution without protection, affecting the WCET estimations. On the contrary, the proposed mechanism is able to restore fault-free WCET estimations with a bounded overhead of 2 execution cycles
Static Probabilistic Timing Analysis for Real-Time Embedded Systems in Presence of Faults
RÉSUMÉ
Une mémoire cache est le lien entre le processeur et la mémoire principale. Elle permet de réduire considérablement les temps d’accès aux blocs de mémoire dans un système embarqué temps-réel et critique (CRTES), ce qui influence énormément son comportement temporel. Des caches à accès aléatoire—caches avec une politique de remplacement aléatoire—ont été proposées dans le but d’améliorer les estimations du comportement temporel des CRTES, et cela en diminuant les cas pathologiques. Les Measurement Based Probabilistic Timing Analysis (MBPTA) et Static Probabilistic Timing Analysis (SPTA) sont deux méthodes
qui ciblent à estimer le pire temps d’exécution (Worst Case Execution Time probabiliste - pWCET) d’une façon probabiliste et sécuritaire pour les caches aléatoires. À travers cette dissertation, on présente des travaux de recherche concernant l’estimation temporelle basée
sur la méthode SPTA. L’état de l’art sur les méthodologies SPTA fournissent des estimations sécuritaires et strictes. En revanche, au vu de la réduction d’échelle des technologies des semiconducteurs utilisés pour la mise en oeuvre des composants faisant partie des CRETS, les
caches sur puce sont de plus en plus prédisposés aux pannes. Par conséquent, nous avons développé des méthodologies SPTA pour l’estimation des pWCETs en présence de pannes.
Nous avons effectué également des évaluations de l’impact de ces fautes sur les comportements temporels. Afin d’examiner les pannes, nous avons modélisé dans un premier temps les pannes transitoires et permanentes. Une panne transitoire représente un changement d’état temporaire. Le système peut ainsi être restauré en utilisant des techniques de détection et de correction des pannes. D’un autre côté, une panne permanente introduit un changement permanent. Elle persiste après son apparition et affecte en conséquence le comportement général du système. Nous avons alors proposé une méthode basée sur les chaînes de Markov afin de modéliser les états de disposition de la mémoire. Pour chaque accès à un bloc de mémoire, le changement de l’état est calculé en utilisant une matrice de transition, tout en tenant compte des impacts des fautes transitoires. Nous avons également utilisé différents types de modèles de
la chaîne de Markov pour représenter le système ayant subi un nombres différent de pannes permanentes. Les expériences montrent que notre méthode SPTA assure des résultats précis
en présence des pannes transitoires et permanentes.----------ABSTRACT : A cache is typically the bridge between a processor and its main memory. It significantly reduces the access latencies to memory blocks and its timing behavior. Random caches—caches with a random replacement policy—have been proposed to improve timing behavior estimates in critical real-time embedded systems (CRTESs) by reducing pathological cases due to systematic cache misses. Measurement Based Probabilistic Timing Analysis (MBPTA)and Static Probabilistic Timing Analysis (SPTA) aim at providing safe probabilistic Worst Case Execution Time (pWCET) estimates for random caches. In this dissertation, we present research work on timing estimation based on SPTA. State-of-the-art SPTA methodologies produce safe and tight pWCET estimates. However, as semiconductor technology scales down, CRTES components—especially their on-chip caches—become prone to faults. Consequently,we developed SPTA methodologies to estimate pWCETs in the presence of faults, and evaluated the impacts of faults on timing behaviors. To investigate faults, we first defined transient and permanent fault models. A transient fault
represents a temporary change of state. The system with transient faults can be recovered using fault detection and correction techniques. A permanent fault represents a permanent change of state. It persists after its occurrence and affects the system’s behavior afterwards. Additionally, we proposed a Markov chain method to model memory layout states. For each memory block access, the state changes are calculated using a transition matrix. The transient fault impacts were integrated into the transition matrix computation, and we used different groups of Markov chain models to represent the system with different number of
permanent faults. Experiments showed that our SPTA method provided accurate results in the presence of both transient and permanent faults
Development and certification of mixed-criticality embedded systems based on probabilistic timing analysis
An increasing variety of emerging systems relentlessly replaces or augments the functionality of mechanical subsystems with embedded electronics. For quantity, complexity, and use, the safety of such subsystems is an increasingly important matter. Accordingly, those systems are subject to safety certification to demonstrate system's safety by rigorous development processes and hardware/software constraints. The massive augment in embedded processors' complexity renders the arduous certification task significantly harder to achieve. The focus of this thesis is to address the certification challenges in multicore architectures: despite their potential to integrate several applications on a single platform, their inherent complexity imperils their timing predictability and certification. Recently, the Measurement-Based Probabilistic Timing Analysis (MBPTA) technique emerged as an alternative to deal with hardware/software complexity. The innovation that MBPTA brings about is, however, a major step from current certification procedures and standards. The particular contributions of this Thesis include: (i) the definition of certification arguments for mixed-criticality integration upon multicore processors. In particular we propose a set of safety mechanisms and procedures as required to comply with functional safety standards. For timing predictability, (ii) we present a quantitative approach to assess the likelihood of execution-time exceedance events with respect to the risk reduction requirements on safety standards. To this end, we build upon the MBPTA approach and we present the design of a safety-related source of randomization (SoR), that plays a key role in the platform-level randomization needed by MBPTA. And (iii) we evaluate current certification guidance with respect to emerging high performance design trends like caches. Overall, this Thesis pushes the certification limits in the use of multicore and MBPTA technology in Critical Real-Time Embedded Systems (CRTES) and paves the way towards their adoption in industry.Una creciente variedad de sistemas emergentes reemplazan o aumentan la funcionalidad de subsistemas mecánicos con componentes electrĂłnicos embebidos. El aumento en la cantidad y complejidad de dichos subsistemas electrĂłnicos asĂ como su cometido, hacen de su seguridad una cuestiĂłn de creciente importancia. Tanto es asĂ que la comercializaciĂłn de estos sistemas crĂticos está sujeta a rigurosos procesos de certificaciĂłn donde se garantiza la seguridad del sistema mediante estrictas restricciones en el proceso de desarrollo y diseño de su hardware y software. Esta tesis trata de abordar los nuevos retos y dificultades dadas por la introducciĂłn de procesadores multi-nĂşcleo en dichos sistemas crĂticos: aunque su mayor rendimiento despierta el interĂ©s de la industria para integrar mĂşltiples aplicaciones en una sola plataforma, suponen una mayor complejidad. Su arquitectura desafĂa su análisis temporal mediante los mĂ©todos tradicionales y, asimismo, su certificaciĂłn es cada vez más compleja y costosa. Con el fin de lidiar con estas limitaciones, recientemente se ha desarrollado una novedosa tĂ©cnica de análisis temporal probabilĂstico basado en medidas (MBPTA). La innovaciĂłn de esta tĂ©cnica, sin embargo, supone un gran cambio cultural respecto a los estándares y procedimientos tradicionales de certificaciĂłn. En esta lĂnea, las contribuciones de esta tesis están agrupadas en tres ejes principales: (i) definiciĂłn de argumentos de seguridad para la certificaciĂłn de aplicaciones de criticidad-mixta sobre plataformas multi-nĂşcleo. Se definen, en particular, mecanismos de seguridad, tĂ©cnicas de diagnĂłstico y reacciĂłn de faltas acorde con el estándar IEC 61508 sobre una arquitectura multi-nĂşcleo de referencia. Respecto al análisis temporal, (ii) presentamos la cuantificaciĂłn de la probabilidad de exceder un lĂmite temporal y su relaciĂłn con los requisitos de reducciĂłn de riesgos derivados de los estándares de seguridad funcional. Con este fin, nos basamos en la tĂ©cnica MBPTA y presentamos el diseño de una fuente de nĂşmeros aleatorios segura; un componente clave para conseguir las propiedades aleatorias requeridas por MBPTA a nivel de plataforma. Por Ăşltimo, (iii) extrapolamos las guĂas actuales para la certificaciĂłn de arquitecturas multi-nĂşcleo a una soluciĂłn comercial de 8 nĂşcleos y las evaluamos con respecto a las tendencias emergentes de diseño de alto rendimiento (caches). Con estas contribuciones, esta tesis trata de abordar los retos que el uso de procesadores multi-nĂşcleo y MBPTA implican en el proceso de certificaciĂłn de sistemas crĂticos de tiempo real y facilita, de esta forma, su adopciĂłn por la industria.Postprint (published version
Exploiting Natural On-chip Redundancy for Energy Efficient Memory and Computing
Power density is currently the primary design constraint across most computing segments and the main performance limiting factor. For years, industry has kept power density constant, while increasing frequency, lowering transistors supply (Vdd) and threshold (Vth) voltages. However, Vth scaling has stopped because leakage current is exponentially related to it. Transistor count and integration density keep doubling every process generation (Moore’s Law), but the power budget caps the amount of hardware that can be active at the same time, leading to dark silicon. With each new generation, there are more resources available, but we cannot fully exploit their performance potential. In the last years, different research trends have explored how to cope with dark silicon and unlock the energy efficiency of the chips, including Near-Threshold voltage Computing (NTC) and approximate computing. NTC aggressively lowers Vdd to values near Vth. This allows a substantial reduction in power, as dynamic power scales quadratically with supply voltage. The resultant power reduction could be used to activate more chip resources and potentially achieve performance improvements. Unfortunately, Vdd scaling is limited by the tight functionality margins of on-chip SRAM transistors. When scaling Vdd down to values near-threshold, manufacture-induced parameter variations affect the functionality of SRAM cells, which eventually become not reliable. A large amount of emerging applications, on the other hand, features an intrinsic error-resilience property, tolerating a certain amount of noise. In this context, approximate computing takes advantage of this observation and exploits the gap between the level of accuracy required by the application and the level of accuracy given by the computation, providing that reducing the accuracy translates into an energy gain. However, deciding which instructions and data and which techniques are best suited for approximation still poses a major challenge. This dissertation contributes in these two directions. First, it proposes a new approach to mitigate the impact of SRAM failures due to parameter variation for effective operation at ultra-low voltages. We identify two levels of natural on-chip redundancy: cache level and content level. The first arises because of the replication of blocks in multi-level cache hierarchies. We exploit this redundancy with a cache management policy that allocates blocks to entries taking into account the nature of the cache entry and the use pattern of the block. This policy obtains performance improvements between 2% and 34%, with respect to block disabling, a technique with similar complexity, incurring no additional storage overhead. The latter (content level redundancy) arises because of the redundancy of data in real world applications. We exploit this redundancy compressing cache blocks to fit them in partially functional cache entries. At the cost of a slight overhead increase, we can obtain performance within 2% of that obtained when the cache is built with fault-free cells, even if more than 90% of the cache entries have at least a faulty cell. Then, we analyze how the intrinsic noise tolerance of emerging applications can be exploited to design an approximate Instruction Set Architecture (ISA). Exploiting the ISA redundancy, we explore a set of techniques to approximate the execution of instructions across a set of emerging applications, pointing out the potential of reducing the complexity of the ISA, and the trade-offs of the approach. In a proof-of-concept implementation, the ISA is shrunk in two dimensions: Breadth (i.e., simplifying instructions) and Depth (i.e., dropping instructions). This proof-of-concept shows that energy can be reduced on average 20.6% at around 14.9% accuracy loss
Multi-core devices for safety-critical systems: a survey
Multi-core devices are envisioned to support the development of next-generation safety-critical systems, enabling the on-chip integration of functions of different criticality. This integration provides multiple system-level potential benefits such as cost, size, power, and weight reduction. However, safety certification becomes a challenge and several fundamental safety technical requirements must be addressed, such as temporal and spatial independence, reliability, and diagnostic coverage. This survey provides a categorization and overview at different device abstraction levels (nanoscale, component, and device) of selected key research contributions that support the compliance with these fundamental safety requirements.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness under grant TIN2015-65316-P, Basque Government under grant KK-2019-00035 and the HiPEAC Network of Excellence. The Spanish Ministry of Economy and Competitiveness has also partially supported Jaume Abella under Ramon y Cajal postdoctoral fellowship (RYC-2013-14717).Peer ReviewedPostprint (author's final draft
- …