25 research outputs found

    Cross-Layer Early Reliability Evaluation for the Computing cOntinuum

    Get PDF
    Advanced multifunctional computing systems realized in forthcoming technologies hold the promise of a significant increase of the computational capability that will offer end-users ever improving services and functionalities (e.g., next generation mobile devices, cloud services, etc.). However, the same path that is leading technologies toward these remarkable achievements is also making electronic devices increasingly unreliable, posing a threat to our society that is depending on the ICT in every aspect of human activities. Reliability of electronic systems is therefore a key challenge for the whole ICT technology and must be guaranteed without penalizing or slowing down the characteristics of the final products. CLERECO EU FP7 (GA No. 611404) research project addresses early accurate reliability evaluation and efficient exploitation of reliability at different design phases, since these aspects are two of the most important and challenging tasks toward this goal

    Multi-faceted microarchitecture level reliability characterization for NVIDIA and AMD GPUs

    Get PDF
    State-of-the-art GPU chips are designed to deliver extreme throughput for graphics as well as for data-parallel general purpose computing workloads (GPGPU computing). Unlike computing for graphics, GPGPU computing requires highly reliable operations. Since provisioning for high reliability may affect performance, the design of GPGPU systems requires the vulnerability of GPU workloads to soft-errors to be jointly evaluated with the performance of GPU chips. We present an extended study based on a consolidated workflow for the evaluation of the reliability in correlation with the performance of four GPU architectures and corresponding chips: AMD Southern Islands and NVIDIA G80/GT200/Fermi. We obtained reliability measurements (AVF and FIT) employing both fault injection and ACE-analysis based on microarchitecture-level simulators. Apart from the reliability-only and performance-only measurements, we propose combined metrics for performance and reliability that assist comparisons for the same application among GPU chips of different ISAs and vendors, as well as among benchmarks on the same GPU chip

    Informed microarchitecture design space exploration using workload dynamics

    Get PDF
    Program runtime characteristics exhibit significant variation. As microprocessor architectures become more complex, their efficiency depends on the capability of adapting with workload dynamics. Moreover, with the approaching billion-transistor microprocessor era, it is not always economical or feasible to design processors with thermal cooling and reliability redundancy capabilities that target an application’s worst case scenario. Therefore, analyzing complex workload dynamics early, at the microarchitecture design stage, is crucial to forecast workload runtime behavior across architecture design alternatives and evaluate the efficiency of workload scenariobased architecture optimizations. Existing methods focus exclusively on predicting aggregated workload behavior. In this paper, we propose accurate and efficient techniques and models to reason about workload dynamics across the microarchitecture design space without using detailed cyclelevel simulations. Our proposed techniques employ waveletbased multiresolution decomposition and neural network based non-linear regression modeling. We extensively evaluate the efficiency of our predictive models in forecasting performance, power and reliability domain workload dynamics that the SPEC CPU 2000 benchmarks manifest on high-performance microprocessors with a microarchitecture design space that consists of 9 key parameters. Our results show that the models achieve high accuracy in revealing workload dynamic behavior across a large microarchitecture design space. We also demonstrate that the proposed techniques can be used to efficiently explore workload scenario-driven architecture optimizations. 1

    Balancing soft error coverage with lifetime reliability in redundantly multithreaded processors

    Get PDF
    Silicon reliability is a key challenge facing the microprocessor industry. Processors need to be designed such that they are resilient against both soft errors and lifetime reliability phenomena. However, techniques developed to address one class of reliability problems may impact other aspects of silicon reliability. In this paper, we show that Redundant Multi-Threading (RMT), which provides soft error protection, exacerbates lifetime reliability. We then explore two different architectural approaches to tackle this problem, namely, Dynamic Voltage Scaling (DVS) and partial RMT. We show that each approach has certain strengths and weaknesses with respect to performance, soft error coverage, and lifetime reliability. We then propose and evaluate a hybrid approach that combines DVS and partial RMT. We show that this approach provides better improvement in lifetime reliability than DVS or partial RMT alone, buys back a significant amount of performance that is lost due to DVS, and provides nearly complete soft error coverage. I

    Translation Lookaside Buffer on the 65-nm STG DICE Hardened Elements

    Get PDF
    This paper presents the design of hardened translation lookaside buffer based on Spaced Transistor Groups (STG) DICE cells in 65-nm bulk CMOS technology. The resistance to impacts of single nuclear particles is achieved by spacing transistors in two groups together with transistors of the output combinational logic. The elements contain two spaced identical groups of transistors. Charge collection from particle tracks by only transistors of just one of the two groups doesn’t lead to the cell upset. The proposed logical element of matching based on the STG DICE cell for a content-addressable memory was simulated using TCAD tool. The results show the resistance to impacts of single nuclear particles with linear energy transfer (LET) values up to 70 MeV×cm2/mg. Short-term noise pulses in combinational logic of the element can be observed in the range of LET values from 20 to 70 MeV×cm2/mg
    corecore