9 research outputs found

    Power and Reliability Management of SoCs

    Get PDF
    Today's embedded systems integrate multiple IP cores for processing, communication, and sensing on a single die as systems-on-chip (SoCs). Aggressive transistor scaling, decreased voltage margins and increased processor power and temperature have made reliability assessment a much more significant issue. Although reliability of devices and interconnect has been broadly studied, in this work, we study a tradeoff between reliability and power consumption for component-based SoC designs. We specifically focus on hard error rates as they cause a device to permanently stop operating. We also present a joint reliability and power management optimization problem whose solution is an optimal management policy. When careful joint policy optimization is performed, we obtain a significant improvement in energy consumption (40%) in tandem with meeting a reliability constraint for all SoC operating temperatures

    Fault tolerance at system level based on RADIC architecture

    Get PDF
    The increasing failure rate in High Performance Computing encourages the investigation of fault tolerance mechanisms to guarantee the execution of an application in spite of node faults. This paper presents an automatic and scalable fault tolerant model designed to be transparent for applications and for message passing libraries. The model consists of detecting failures in the communication socket caused by a faulty node. In those cases, the affected processes are recovered in a healthy node and the connections are reestablished without losing data. The Redundant Array of Distributed Independent Controllers architecture proposes a decentralized model for all the tasks required in a fault tolerance system: protection, detection, recovery and masking. Decentralized algorithms allow the application to scale, which is a key property for current HPC system. Three different rollback recovery protocols are defined and discussed with the aim of offering alternatives to reduce overhead when multicore systems are used. A prototype has been implemented to carry out an exhaustive experimental evaluation through Master/Worker and Single Program Multiple Data execution models. Multiple workloads and an increasing number of processes have been taken into account to compare the above mentioned protocols. The executions take place in two multicore Linux clusters with different socket communications libraries

    Reliability-Aware Design for Nanometer-Scale Devices, January 2008

    Get PDF
    Continuous transistor scaling due to improvements in CMOS devices and manufacturing technologies is increasing processor power densities and temperatures; thus, creating challenges to maintain manufacturing yield rates and reliable devices in their expected lifetimes for latest nanometer-scale dimensions. In fact, new system and processor microarchitectures require new reliability-aware design methods and exploration tools that can face these challenges without significantly increasing manufacturing cost, reducing system performance or imposing large area overheads due to redundancy. In this paper we overview the latest approaches in reliability modeling and variability-tolerant design for latest technology nodes, and advocate the need of reliability-aware design for forthcoming consumer electronics. Moreover, we illustrate with a case study of an embedded processor that effec- tive reliability-aware design can be achieved in nanometer-scale devices through integral design approaches that covers modeling and exploration of reliability effects, and hardware-software architectural techniques to provide reliability-enhanced solutions at both microarchitectural- and system-level

    Dynamic thermal management in 3D multicore architectures

    Get PDF
    Technology scaling has caused the feature sizes to shrink continuously, whereas interconnects, unlike transistors, have not followed the same trend. Designing 3D stack architectures is a recently proposed approach to overcome the power consumption and delay problems associated with the interconnects by reducing the length of the wires going across the chip. However, 3D integration introduces serious thermal challenges due to the high power density resulting from placing computational units on top of each other. In this work, we first investigate how the existing thermal management, power management and job scheduling policies affect the thermal behavior in 3D chips. We then propose a dynamic thermally-aware job scheduling technique for 3D systems to reduce the thermal problems at very low performance cost. Our technique can also be integrated with power management policies to reduce energy consumption while avoiding the thermal hot spots and large temperature variations

    Proactive temperature balancing for low cost thermal management in MPSoCs

    Full text link
    Abstract — Designing thermal management strategies that reduce the impact of hot spots and on-die temperature variations at low performance cost is a very significant challenge for multiprocessor system-on-chips (MPSoCs). In this work, we present a proactive MPSoC thermal man-agement approach, which predicts the future temperature and adjusts the job allocation on the MPSoC to minimize the impact of thermal hot spots and temperature variations without degrading performance. In addition, we implement and compare several reactive and proactive management strategies, and demonstrate that our proactive temperature-aware MPSoC job allocation technique is able to dramatically reduce the adverse effects of temperature at very low performance cost. We show experimental results using a simulator as well as an implementation on an UltraSPARC T1 system. I

    Dynamic thermal management in 3D multicore architectures

    Full text link

    Energy reconstruction on the LHC ATLAS TileCal upgraded front end: feasibility study for a sROD co-processing unit

    Get PDF
    Dissertation presented in ful lment of the requirements for the degree of: Master of Science in Physics 2016The Phase-II upgrade of the Large Hadron Collider at CERN in the early 2020s will enable an order of magnitude increase in the data produced, unlocking the potential for new physics discoveries. In the ATLAS detector, the upgraded Hadronic Tile Calorimeter (TileCal) Phase-II front end read out system is currently being prototyped to handle a total data throughput of 5.1 TB/s, from the current 20.4 GB/s. The FPGA based Super Read Out Driver (sROD) prototype must perform an energy reconstruction algorithm on 2.88 GB/s raw data, or 275 million events per second. Due to the very high level of pro ciency required and time consuming nature of FPGA rmware development, it may be more e ective to implement certain complex energy reconstruction and monitoring algorithms on a general purpose, CPU based sROD co-processor. Hence, the feasibility of a general purpose ARM System on Chip based co-processing unit (PU) for the sROD is determined in this work. A PCI-Express test platform was designed and constructed to link two ARM Cortex-A9 SoCs via their PCI-Express Gen-2 x1 interfaces. Test results indicate that the latency of the PCI-Express interface is su ciently low and the data throughput is superior to that of alternative interfaces such as Ethernet, for use as an interconnect for the SoCs to the sROD. CPU performance benchmarks were performed on ve ARM development platforms to determine the CPU integer, oating point and memory system performance as well as energy e ciency. To complement the benchmarks, Fast Fourier Transform and Optimal Filtering (OF) applications were also tested. Based on the test results, in order for the PU to process 275 million events per second with OF, within the 6 s timing budget of the ATLAS triggering system, a cluster of three Tegra-K1, Cortex-A15 SoCs connected to the sROD via a Gen-2 x8 PCI-Express interface would be suitable. A high level design for the PU is proposed which surpasses the requirements for the sROD co-processor and can also be used in a general purpose, high data throughput system, with 80 Gb/s Ethernet and 15 GB/s PCI-Express throughput, using four X-Gene SoCs

    Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems

    Get PDF
    Nanoscale technology nodes bring reliability concerns back to the center stage of digital system design. A systematic classification of approaches that increase system resilience in the presence of functional hardware (HW)-induced errors is presented, dealing with higher system abstractions, such as the (micro) architecture, the mapping, and platform software (SW). The field is surveyed in a systematic way based on nonoverlapping categories, which add insight into the ongoing work by exposing similarities and differences. HW and SW solutions are discussed in a similar fashion so that interrelationships become apparent. The presented categories are illustrated by representative literature examples to illustrate their properties. Moreover, it is demonstrated how hybrid schemes can be decomposed into their primitive components

    Estimation à haut-niveau des dégradations temporelles dans les processeurs (méthodologie et mise en oeuvre logicielle)

    Get PDF
    Actuellement, les circuits numériques nécessitent d'être de plus en plus performants. Aussi, les produits doivent être conçus le plus rapidement possible afin de gagner les précieuses parts de marché. Les méthodes rapides de conception et l'utilisation de MPSoC ont permis de satisfaire à ces exigences, mais sans tenir compte précisément de l'impact du vieillissement des circuits sur la conception. Or les MPSoC utilisent les technologies de fabrication les plus récentes et sont de plus en plus soumis aux défaillances matérielles. De nos jours, les principaux mécanismes de défaillance observés dans les transistors des MPSoC sont le HCI et le NBTI. Des marges sont alors ajoutées pour que le circuit soit fonctionnel pendant son utilisation, en considérant le cas le plus défavorable pour chaque mécanisme. Ces marges deviennent de plus en plus importantes et diminuent les performances attendues. C'est pourquoi les futures méthodes de conception nécessitent de tenir compte des dégradations matérielles en fonction de l utilisation du circuit. Dans cette thèse, nous proposons une méthode originale pour simuler le vieillissement des MPSoC à haut niveau d'abstraction. Cette méthode s'applique lors de la conception du système c.-à-d. entre l'étape de définition des spécifications et la mise en production. Un modèle empirique permet d'estimer les dégradations temporelles en fin de vie d'un circuit. Un exemple d'application est donné pour un processeur embarqué et les résultats pour un ensemble d'applications sont reportés. La solution proposée permet d'explorer différentes configurations d'une architecture MPSoC pour comparer le vieillissement. Aussi, l'application la plus sévère pour le vieillissement peut être identifiée.Nowadays, more and more performance is expected from digital circuits. What s more, the market requires fast conception methods, in order to propose the newest technology available. Fast conception methods and the utilization of MPSoC have enabled high performance and short time-to-market while taking little attention to aging. However, MPSoC are more and more prone to hardware failures that occur in transistors. Today, the prevailing failure mechanisms in MPSoC are HCI and NBTI. Margins are usually added on new products to avoid failures during execution, by considering worst case scenario for each mechanism. For the newest technology, margins are becoming more and more important and products performance is getting lower and lower. That s why the conception needs to take into account hardware failures according to the execution of software. This thesis propose a new methodology to simulate aging at high level of abstraction, which can be applied to MPSoC. The method can be applied during product conception, between the specification phase and the production. An empirical model is used to estimate slack time at circuit's end of life. A use case is conducted on an embedded processor and degradation results are reported for a set of applications. The solution enables architecture exploration and MPSoC aging can thus be compared. The software with most severe impact on aging can also be determined.BORDEAUX1-Bib.electronique (335229901) / SudocSudocFranceF
    corecore