25 research outputs found

    Timing constraints for high speed counterflow-clocked pipelining

    Get PDF
    technical reportWith the escalation of clock frequencies and the increasing ratio of wire- to gate-delays, clock skew is a major problem to be overcome in tomorrow's high-speed VLSI chips. Also, with an increasing number of stages switching simultaneously comes the problem of higher peak power consumption. In our past work, we have proposed a novel scheme called Counterflow-Clocked (C2) Pipelining to combat these problem, and discussed methods for composing C2 pipelined stages. In this paper, we analyze, in great detail, the timing constraints to be obeyed in designing basic C2 pipelined stages as well as in composing C2 pipelined stages. C2 pipelining is well suited for systems that exhibit mostly uni-directional data flows as well as possess mostly nearest-neighbor connections. We illustrate C2 pipelining on such a design with several design examples. C2 pipelining eases the distribution of high speed clocks, shortens the clock period by eliminating global clock signals, allows natural use of level-sensitive dynamic latches, and generates less internal switching noise due to the uniformly distributed latch operation. By applying C2 pipelining and its composition methods to build a system, VLSI designers can substitute the global clock skew problem with many local one-sided delay constraints

    Elastic circuits

    Get PDF
    Elasticity in circuits and systems provides tolerance to variations in computation and communication delays. This paper presents a comprehensive overview of elastic circuits for those designers who are mainly familiar with synchronous design. Elasticity can be implemented both synchronously and asynchronously, although it was traditionally more often associated with asynchronous circuits. This paper shows that synchronous and asynchronous elastic circuits can be designed, analyzed, and optimized using similar techniques. Thus, choices between synchronous and asynchronous implementations are localized and deferred until late in the design process.Peer ReviewedPostprint (published version

    Architecture Independent Timing Speculation Techniques in VLSI Circuits.

    Full text link
    Conventional digital circuits must ensure correct operation throughout a wide range of operating conditions including process, voltage, and temperature variation. These conditions have an effect on circuit delays, and safety margins must be put in place which come at a power and performance cost. The Razor system proposed eliminating these timing margins by running a circuit with occasional timing errors and correcting the errors when they occur. Several existing Razor style designs have been proposed, however prior to this work, Razor could not be applied blindly or automatically to designs, as the various error correction schemes modified the architecture of the target design. Because of the architectural invasiveness and design complexities of these techniques, no published Razor style system had been applied to a complete existing commercial processor. Additionally, in all prior Razor-style systems, there is a fundamental tradeoff between speculation window and short path, or minimum delay, constraints, limiting the technique’s effectiveness. This thesis introduces the concept of Razor using two-phase latch based timing. By identifying and utilizing time borrowing as an error correction mechanism, it allows for Razor to be applied without the need to reload data or replay instructions. This allows for Razor to be blindly and automatically applied to existing designs without detailed knowledge of internal architecture. Additionally, latch based Razor allows for large speculation windows, up to 100% of nominal circuit delay, because it breaks the connection between minimum delay constraints and speculation window. By demonstrating how to transform conventional flip-flop based designs, including those which make use of clock gating, to two-phase latch based timing, Razor can be automatically added to a large set of existing digital designs. Two forms of latch based Razor are proposed. First, Bubble Razor involves rippling stall cycles throughout a circuit in response to timing errors and is applied to the ARM Cortex-M3 processor, the first ever application of a Razor technique to a complete, existing processor design. Additional work applies Bubble Razor to the ARM Cortex-R4 processor. The second latch based Razor technique, Voltage Razor, uses voltage boosting to correct for timing errors.PHDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/102461/1/mfojtik_1.pd

    Increasing rendering performance of graphics hardware

    Get PDF
    Graphics Processing Unit (GPU) performance is increasing faster than central processing unit (CPU) performance. This growth is driven by performance improvements that can be divided into the following three categories: algorithmic improvements, architectural improvements, and circuit-level improvements. In this dissertation I present techniques that improve the rendering performance of graphics hardware measured in speed, power consumption or image quality in each of these three areas. At the algorithmic level, I introduce a method for using graphics hardware to rapidly and efficiently generate summed-area tables, which are data structures that hold pre-computed two-dimensional integrals of subsets of a given image, and present several novel rendering techniques that take advantage of summed-area tables to produce dynamic, high-quality images at interactive frame rates. These techniques improve the visual quality of images rendered on current commodity GPUs without requiring modifications to the underlying hardware or architecture. At the architectural level, I propose modifications to the architecture of current GPUs that add conditional streaming capabilities. I describe a novel GPU-based ray-tracing algorithm that takes advantage of conditional output streams to reduce the memory bandwidth requirements by over an order of magnitude times when compared to previous techniques. At the circuit level, I propose a compute-on-demand paradigm for the design of high-speed and energy-efficient graphics components. The goal of the compute-on-demand paradigm is to only perform computation at the bit-level when needed. The compute-on-demand paradigm exploits the data-dependent nature of computation, and thereby obtains speed and energy improvements by optimizing designs for the common case. This approach is illustrated with the design of a high-speed Z-comparator that is implemented using asynchronous logic. Asynchronous or "clockless" circuits were chosen for my implementations since they allow for data-dependent completion times and reduced power consumption by disabling inactive components. The resulting circuit-level implementation runs over 1.5 times faster while on dissipating 25% the energy of a comparable synchronous comparator for the average case. Also at the circuit-level, I introduce a novel implementation of counterflow pipelining, which allows two streams of data to flow in opposite directions within the same pipeline without the need for complex arbitration. The advantages of this implementation are demonstrated by the design of a high-speed asynchronous Booth multiplier. While both the comparator and the multiplier are useful components of a graphics pipeline, the objective of this work was to propose the new design paradigm as a promising alternative to current graphics hardware design practices

    Timing speculation and adaptive reliable overclocking techniques for aggressive computer systems

    Get PDF
    Computers have changed our lives beyond our own imagination in the past several decades. The continued and progressive advancements in VLSI technology and numerous micro-architectural innovations have played a key role in the design of spectacular low-cost high performance computing systems that have become omnipresent in today\u27s technology driven world. Performance and dependability have become key concerns as these ubiquitous computing machines continue to drive our everyday life. Every application has unique demands, as they run in diverse operating environments. Dependable, aggressive and adaptive systems improve efficiency in terms of speed, reliability and energy consumption. Traditional computing systems run at a fixed clock frequency, which is determined by taking into account the worst-case timing paths, operating conditions, and process variations. Timing speculation based reliable overclocking advocates going beyond worst-case limits to achieve best performance while not avoiding, but detecting and correcting a modest number of timing errors. The success of this design methodology relies on the fact that timing critical paths are rarely exercised in a design, and typical execution happens much faster than the timing requirements dictated by worst-case design methodology. Better-than-worst-case design methodology is advocated by several recent research pursuits, which exploit dependability techniques to enhance computer system performance. In this dissertation, we address different aspects of timing speculation based adaptive reliable overclocking schemes, and evaluate their role in the design of low-cost, high performance, energy efficient and dependable systems. We visualize various control knobs in the design that can be favorably controlled to ensure different design targets. As part of this research, we extend the SPRIT3E, or Superscalar PeRformance Improvement Through Tolerating Timing Errors, framework, and characterize the extent of application dependent performance acceleration achievable in superscalar processors by scrutinizing the various parameters that impact the operation beyond worst-case limits. We study the limitations imposed by short-path constraints on our technique, and present ways to exploit them to maximize performance gains. We analyze the sensitivity of our technique\u27s adaptiveness by exploring the necessary hardware requirements for dynamic overclocking schemes. Experimental analysis based on SPEC2000 benchmarks running on a SimpleScalar Alpha processor simulator, augmented with error rate data obtained from hardware simulations of a superscalar processor, are presented. Even though reliable overclocking guarantees functional correctness, it leads to higher power consumption. As a consequence, reliable overclocking without considering on-chip temperatures will bring down the lifetime reliability of the chip. In this thesis, we analyze how reliable overclocking impacts the on-chip temperature of a microprocessor and evaluate the effects of overheating, due to such reliable dynamic frequency tuning mechanisms, on the lifetime reliability of these systems. We then evaluate the effect of performing thermal throttling, a technique that clamps the on-chip temperature below a predefined value, on system performance and reliability. Our study shows that a reliably overclocked system with dynamic thermal management achieves 25% performance improvement, while lasting for 14 years when being operated within 353K. Over the past five decades, technology scaling, as predicted by Moore\u27s law, has been the bedrock of semiconductor technology evolution. The continued downscaling of CMOS technology to deep sub-micron gate lengths has been the primary reason for its dominance in today\u27s omnipresent silicon microchips. Even as the transition to the next technology node is indispensable, the initial cost and time associated in doing so presents a non-level playing field for the competitors in the semiconductor business. As part of this thesis, we evaluate the capability of speculative reliable overclocking mechanisms to maximize performance at a given technology level. We evaluate its competitiveness when compared to technology scaling, in terms of performance, power consumption, energy and energy delay product. We present a comprehensive comparison for integer and floating point SPEC2000 benchmarks running on a simulated Alpha processor at three different technology nodes in normal and enhanced modes. Our results suggest that adopting reliable overclocking strategies will help skip a technology node altogether, or be competitive in the market, while porting to the next technology node. Reliability has become a serious concern as systems embrace nanometer technologies. In this dissertation, we propose a novel fault tolerant aggressive system that combines soft error protection and timing error tolerance. We replicate both the pipeline registers and the pipeline stage combinational logic. The replicated logic receives its inputs from the primary pipeline registers while writing its output to the replicated pipeline registers. The organization of redundancy in the proposed Conjoined Pipeline system supports overclocking, provides concurrent error detection and recovery capability for soft errors, intermittent faults and timing errors, and flags permanent silicon defects. The fast recovery process requires no checkpointing and takes three cycles. Back annotated post-layout gate-level timing simulations, using 45nm technology, of a conjoined two-stage arithmetic pipeline and a conjoined five-stage DLX pipeline processor, with forwarding logic, show that our approach, even under a severe fault injection campaign, achieves near 100% fault coverage and an average performance improvement of about 20%, when dynamically overclocked

    Doctor of Philosophy

    Get PDF
    dissertationElasticity is a design paradigm in which circuits can tolerate arbitrary latency/delay variations in their computation units as well as communication channels. Creating elastic (both synchronous and asynchronous) designs from clocked designs has potential benefits of increased modularity and robustness to variations. Several transformations have been suggested in the literature and each of these require a handshake control network (examples include synchronous elasticization and desynchronization). Elastic control network area and power overheads may become prohibitive. This dissertation investigates different optimization avenues to reduce these overheads without sacrificing the control network performance. First, an algorithm and a tool, CNG, is introduced that generates a control network with minimal total number of join and fork control steering units. Synchronous Elastic FLow (SELF) is a handshake protocol used over synchronous elastic designs. Comparing to its standard eager implementation (that uses eager forks - EForks), lazy SELF can consume less power and area. However, it typically suff ers from combinational cycles and can have inferior performance in some systems. Hence, lazy SELF has been rarely studied in the literature. This work formally and exhaustively investigates the specifi cations, diff erent implementations, and verifi cation of the lazy SELF protocol. Furthermore, several new and existing lazy designs are mapped to hybrid eager/lazy imple-mentations that retain the performance advantage of the eager design but have power and area advantages of lazy implementations, and are combinational-cycle free. This work also introduces a novel ultra simple fork (USFork) design. The USFork has two advantages over lazy forks: it is composed of simpler logic (just wires) and does not form combinational cycles. The conditions under which an EFork can be replaced by a USFork without any performance loss are formally derived. The last optimization avenue discussed in this dissertation is Elastic Bu er Controller (EBC) merging. In a typical synchronous elastic control network, some EBCs may activate their corresponding latches at similar schedules. This work provides a framework for fi nding and merging such controllers in any control network; including open networks (i.e., when the environment abstract is not available or required to be flexible) as well as networks incorporating variable latency units. Replacing EForks with USForks under some equivalence conditions as well as EBC merging have been fully automated in a tool, HGEN. The impact of this work will help achieve elasticity at a reduced cost. It will broaden the class of circuits that can be elasticized with acceptable overhead (circuits that designers would otherwise nd it too expensive to elasticize). In a MiniMIPS processor case study, comparing to a basic control network implementation, the optimization techniques of this dissertation accumulatively achieve reductions in the control network area, dynamic, and leakage power of 73.2%, 68.6%, and 69.1%, respectively
    corecore