102 research outputs found

    Understanding Soft Errors in Uncore Components

    Full text link
    The effects of soft errors in processor cores have been widely studied. However, little has been published about soft errors in uncore components, such as memory subsystem and I/O controllers, of a System-on-a-Chip (SoC). In this work, we study how soft errors in uncore components affect system-level behaviors. We have created a new mixed-mode simulation platform that combines simulators at two different levels of abstraction, and achieves 20,000x speedup over RTL-only simulation. Using this platform, we present the first study of the system-level impact of soft errors inside various uncore components of a large-scale, multi-core SoC using the industrial-grade, open-source OpenSPARC T2 SoC design. Our results show that soft errors in uncore components can significantly impact system-level reliability. We also demonstrate that uncore soft errors can create major challenges for traditional system-level checkpoint recovery techniques. To overcome such recovery challenges, we present a new replay recovery technique for uncore components belonging to the memory subsystem. For the L2 cache controller and the DRAM controller components of OpenSPARC T2, our new technique reduces the probability that an application run fails to produce correct results due to soft errors by more than 100x with 3.32% and 6.09% chip-level area and power impact, respectively.Comment: to be published in Proceedings of the 52nd Annual Design Automation Conferenc

    E-QED: Electrical Bug Localization During Post-Silicon Validation Enabled by Quick Error Detection and Formal Methods

    Full text link
    During post-silicon validation, manufactured integrated circuits are extensively tested in actual system environments to detect design bugs. Bug localization involves identification of a bug trace (a sequence of inputs that activates and detects the bug) and a hardware design block where the bug is located. Existing bug localization practices during post-silicon validation are mostly manual and ad hoc, and, hence, extremely expensive and time consuming. This is particularly true for subtle electrical bugs caused by unexpected interactions between a design and its electrical state. We present E-QED, a new approach that automatically localizes electrical bugs during post-silicon validation. Our results on the OpenSPARC T2, an open-source 500-million-transistor multicore chip design, demonstrate the effectiveness and practicality of E-QED: starting with a failed post-silicon test, in a few hours (9 hours on average) we can automatically narrow the location of the bug to (the fan-in logic cone of) a handful of candidate flip-flops (18 flip-flops on average for a design with ~ 1 Million flip-flops) and also obtain the corresponding bug trace. The area impact of E-QED is ~2.5%. In contrast, deter-mining this same information might take weeks (or even months) of mostly manual work using traditional approaches

    Systematic energy characterization of CMP/SMT processor systems via automated micro-benchmarks

    Get PDF
    Microprocessor-based systems today are composed of multi-core, multi-threaded processors with complex cache hierarchies and gigabytes of main memory. Accurate characterization of such a system, through predictive pre-silicon modeling and/or diagnostic postsilicon measurement based analysis are increasingly cumbersome and error prone. This is especially true of energy-related characterization studies. In this paper, we take the position that automated micro-benchmarks generated with particular objectives in mind hold the key to obtaining accurate energy-related characterization. As such, we first present a flexible micro-benchmark generation framework (MicroProbe) that is used to probe complex multi-core/multi-threaded systems with a variety and range of energy-related queries in mind. We then present experimental results centered around an IBM POWER7 CMP/SMT system to demonstrate how the systematically generated micro-benchmarks can be used to answer three specific queries: (a) How to project application-specific (and if needed, phase-specific) power consumption with component-wise breakdowns? (b) How to measure energy-per-instruction (EPI) values for the target machine? (c) How to bound the worst-case (maximum) power consumption in order to determine safe, but practical (i.e. affordable) packaging or cooling solutions? The solution approaches to the above problems are all new. Hardware measurement based analysis shows superior power projection accuracy (with error margins of less than 2.3% across SPEC CPU2006) as well as max-power stressing capability (with 10.7% increase in processor power over the very worst-case power seen during the execution of SPEC CPU2006 applications).Peer ReviewedPostprint (author’s final draft

    Energy Efficient Servers

    Get PDF
    Computer scienc

    Automatic Loop Kernel Analysis and Performance Modeling With Kerncraft

    Full text link
    Analytic performance models are essential for understanding the performance characteristics of loop kernels, which consume a major part of CPU cycles in computational science. Starting from a validated performance model one can infer the relevant hardware bottlenecks and promising optimization opportunities. Unfortunately, analytic performance modeling is often tedious even for experienced developers since it requires in-depth knowledge about the hardware and how it interacts with the software. We present the "Kerncraft" tool, which eases the construction of analytic performance models for streaming kernels and stencil loop nests. Starting from the loop source code, the problem size, and a description of the underlying hardware, Kerncraft can ideally predict the single-core performance and scaling behavior of loops on multicore processors using the Roofline or the Execution-Cache-Memory (ECM) model. We describe the operating principles of Kerncraft with its capabilities and limitations, and we show how it may be used to quickly gain insights by accelerated analytic modeling.Comment: 11 pages, 4 figures, 8 listing

    Energy Efficient Servers

    Get PDF
    Computer scienc
    • …
    corecore