69 research outputs found

    High Peformance and Low Power On-Die Interconnect Fabrics.

    Full text link
    Increasing power density with technology scaling has caused stagnation in operating frequency of modern day microprocessors. This has led designers to prefer multicore architectures over complex monolithic processors to keep up with the demand for rising computing throughput. Although processing units are getting smaller and simpler, the dramatic rise of their count on a single die has made the fabric that connects these processing units increasingly complex. These interconnect fabrics have become a bottleneck in improving overall system effciency. As a result, the design paradigm for multi-core chips is gradually shifting from a core-centric architecture towards an interconnect-centric architecture, where system efficiency is limited by the fabric rather than the processing ability of any individual core. This dissertation introduces three novel and synergistic circuit techniques to improve scalability of switch fabrics to make on-die integration of hundreds to thousands of cores feasible. 1) A matrix topology is proposed for designing a fully connected switch fabric that re-uses output buses for programming, and stores shue congurations at cross points. This significantly reduces routing congestion, lowers area/power, and improves per- formance. Silicon measurements demonstrate 47% energy savings in a 64-lane SIMD processor fabricated in 65nm CMOS over a conventional implementation. 2) A novel approach to handle high radix arbitration along with data routing is proposed. It optimally uses existing cross-bar interconnect resources without requiring any additional overhead. Bandwidth exceeding 2Tb/s is recorded in a test prototype fabricated in 65nm. 3) Building on the later, a new circuit topology to manage and update priority adaptively within the switch fabric without incurring additional delay or area is then proposed. Several assist circuit techniques, such as a thyristor based sense amplifier and self regenerating bi-directional repeaters are proposed for high speed energy efficient signaling to and from the switch fabric to improve overall routing efficiency. Using these techniques a 64 x 64 switch fabric with 128b data bus fabricated in 45nm achieves a throughput of 4.5Tb/s at single cycle latency while operating at 559MHz.Ph.D.Electrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/91506/1/sudhirks_1.pd

    Baseband-processor for a passive UHF RFID transponder

    Get PDF
    This paper describes the design of a digital processor targeting the Class-1 Generation-2 EPC Protocol for UHF RFID transponders, and proposes different techniques for reducing its power consumption. The processor has been implemented in a 0.35μm CMOS technology process using automatic tools for both the logic synthesis and layout. Post-layout simulations confirm the fully functionality of the prototype and predict a worst-case power consumption of only 2.9μA at 1.2V supply.Ministerio de Educación y Ciencia TEC2006-03022, TEC2009-08447Junta de Andalucía TIC-0281

    Energy-aware synthesis for networks on chip architectures

    Full text link
    The Network on Chip (NoC) paradigm was introduced as a scalable communication infrastructure for future System-on-Chip applications. Designing application specific customized communication architectures is critical for obtaining low power, high performance solutions. Two significant design automation problems are the creation of an optimized configuration, given application requirement the implementation of this on-chip network. Automating the design of on-chip networks requires models for estimating area and energy, algorithms to effectively explore the design space and network component libraries and tools to generate the hardware description. Chip architects are faced with managing a wide range of customization options for individual components, routers and topology. As energy is of paramount importance, the effectiveness of any custom NoC generation approach lies in the availability of good energy models to effectively explore the design space. This thesis describes a complete NoC synthesis flow, called NoCGEN, for creating energy-efficient custom NoC architectures. Three major automation problems are addressed: custom topology generation, energy modeling and generation. An iterative algorithm is proposed to generate application specific point-to-point and packet-switched networks. The algorithm explores the design space for efficient topologies using characterized models and a system-level floorplanner for evaluating placement and wire-energy. Prior to our contribution, building an energy model required careful analysis of transistor or gate implementations. To alleviate the burden, an automated linear regression-based methodology is proposed to rapidly extract energy models for many router designs. The resulting models are cycle accurate with low-complexity and found to be within 10% of gate-level energy simulations, and execute several orders of magnitude faster than gate-level simulations. A hardware description of the custom topology is generated using a parameterizable library and custom HDL generator. Fully reusable and scalable network components (switches, crossbars, arbiters, routing algorithms) are described using a template approach and are used to compose arbitrary topologies. A methodology for building and composing routers and topologies using a template engine is described. The entire flow is implemented as several demonstrable extensible tools with powerful visualization functionality. Several experiments are performed to demonstrate the design space exploration capabilities and compare it against a competing min-cut topology generation algorithm

    Design methodology and productivity improvement in high speed VLSI circuits

    Get PDF
    2017 Spring.Includes bibliographical references.To view the abstract, please see the full text of the document

    Low power VLSI implementation schemes for DCT-based image compression

    Get PDF

    Design of platform for exploring application-specific NoC architecture.

    Get PDF
    Liu, Zhouyi.Thesis (M.Phil.)--Chinese University of Hong Kong, 2011.Includes bibliographical references (leaves 110-114).Abstracts in English and Chinese.ABSTRACTS --- p.I摘要 --- p.IICONTENTS --- p.IIILIST OF FIGURE --- p.VLIST OF TABLE --- p.VIACKNOWLEDGEMENT --- p.VIIChapter CHAPTER 1 --- INTRODUCTION --- p.1Chapter 1.1 --- NETWORK-ON-CHIP --- p.1Chapter 1.2 --- RELATED WORKS --- p.2Chapter 1.3 --- PLATFORM OVERVEW --- p.6Chapter 1.4 --- AUTHOR'S CONTRIBUTION --- p.10Chapter CHAPTER 2 --- NOC LIBRARY --- p.12Chapter 2.1 --- NETWORK TERMINOLOGY --- p.12Chapter 2.2 --- BASIC STRUCTURE --- p.15Chapter 2.3 --- LOW-POWER ORIENTED ARCHITECTURE --- p.20Chapter 2.3.1 --- Low-Cost Allocator Design --- p.21Chapter 2.3.2 --- Clock Gating --- p.22Chapter 2.3.3 --- Express Virtual Channel Insertion --- p.22Chapter 2.4 --- LOW-LATENCY ORIENTED ARCHITECTURE --- p.28Chapter 2.4.1. --- Lookahead Bypass Scheme --- p.29Chapter 2.4.2. --- Lookahead Bypass Router Architecture --- p.29Chapter CHAPTER 3 --- BENCHMARK AND MEASUREMENT --- p.31Chapter 3.1 --- BENCHMARK GENERATION --- p.32Chapter 3.1.1 --- Types of Traffic Patterns --- p.32Chapter 3.1.2 --- Traffic Generator --- p.36Chapter 3.2 --- MEASUREMENT SETTING --- p.38Chapter 3.2.1 --- Warming-up Period. --- p.38Chapter 3.2.2 --- Latency Definition --- p.39Chapter 3.2.3 --- Throughput Definition --- p.40Chapter 3.2.4 --- Virtual Channel Utilization --- p.40Chapter CHAPTER 4 --- PLATFORM STRUCTURE --- p.41Chapter 4.1 --- FILE TREE --- p.42Chapter 4.1.1 --- System Files --- p.46Chapter 4.1.2 --- Low-Power NoC Related --- p.47Chapter 4.1.3 --- Low-Latency NoC Related --- p.50Chapter 4.1.4 --- Project Related --- p.51Chapter 4.2 --- PROCESSES --- p.52Chapter 4.3 --- GUI ACCESS --- p.56Chapter 4.3.1 --- Section 1: Project Setup --- p.58Chapter 4.3.2 --- Section 2-a: Low-Power Router Structure --- p.59Chapter 4.3.3 --- Section 2-b: Low-Latency Router Structure --- p.60Chapter 4.3.4 --- Section 3: Benchmark & Measurement --- p.60Chapter 4.3.5 --- Section 4: View Result --- p.62Chapter 4.3.6 --- Low-Power NoC Example --- p.62Chapter CHAPTER 5 --- OPTIMIZATION AND COMPARISON --- p.72Chapter 5.1 --- OPTIMIZATION TECHNIQUE --- p.72Chapter 5.1.1 --- Optimization Phase 1: Inactive Buffer Removal --- p.73Chapter 5.1.2 --- Optimization Phase 2: Infighting Analysis --- p.74Chapter 5.1.3 --- Over-Optimization --- p.75Chapter 5.1.4 --- Optimization Example --- p.79Chapter 5.2 --- NOCS COMPARISON --- p.83Chapter 5.3 --- LOW-POWER IMPLEMENTATION CODE EXPORT --- p.88Chapter CHAPTER 6 --- SUMMARY AND FUTURE WORK --- p.92Chapter 6.1. --- SUMMARY --- p.92Chapter 6.2. --- FUTURE WORK --- p.93REFERENCES --- p.9

    Timing speculation and adaptive reliable overclocking techniques for aggressive computer systems

    Get PDF
    Computers have changed our lives beyond our own imagination in the past several decades. The continued and progressive advancements in VLSI technology and numerous micro-architectural innovations have played a key role in the design of spectacular low-cost high performance computing systems that have become omnipresent in today\u27s technology driven world. Performance and dependability have become key concerns as these ubiquitous computing machines continue to drive our everyday life. Every application has unique demands, as they run in diverse operating environments. Dependable, aggressive and adaptive systems improve efficiency in terms of speed, reliability and energy consumption. Traditional computing systems run at a fixed clock frequency, which is determined by taking into account the worst-case timing paths, operating conditions, and process variations. Timing speculation based reliable overclocking advocates going beyond worst-case limits to achieve best performance while not avoiding, but detecting and correcting a modest number of timing errors. The success of this design methodology relies on the fact that timing critical paths are rarely exercised in a design, and typical execution happens much faster than the timing requirements dictated by worst-case design methodology. Better-than-worst-case design methodology is advocated by several recent research pursuits, which exploit dependability techniques to enhance computer system performance. In this dissertation, we address different aspects of timing speculation based adaptive reliable overclocking schemes, and evaluate their role in the design of low-cost, high performance, energy efficient and dependable systems. We visualize various control knobs in the design that can be favorably controlled to ensure different design targets. As part of this research, we extend the SPRIT3E, or Superscalar PeRformance Improvement Through Tolerating Timing Errors, framework, and characterize the extent of application dependent performance acceleration achievable in superscalar processors by scrutinizing the various parameters that impact the operation beyond worst-case limits. We study the limitations imposed by short-path constraints on our technique, and present ways to exploit them to maximize performance gains. We analyze the sensitivity of our technique\u27s adaptiveness by exploring the necessary hardware requirements for dynamic overclocking schemes. Experimental analysis based on SPEC2000 benchmarks running on a SimpleScalar Alpha processor simulator, augmented with error rate data obtained from hardware simulations of a superscalar processor, are presented. Even though reliable overclocking guarantees functional correctness, it leads to higher power consumption. As a consequence, reliable overclocking without considering on-chip temperatures will bring down the lifetime reliability of the chip. In this thesis, we analyze how reliable overclocking impacts the on-chip temperature of a microprocessor and evaluate the effects of overheating, due to such reliable dynamic frequency tuning mechanisms, on the lifetime reliability of these systems. We then evaluate the effect of performing thermal throttling, a technique that clamps the on-chip temperature below a predefined value, on system performance and reliability. Our study shows that a reliably overclocked system with dynamic thermal management achieves 25% performance improvement, while lasting for 14 years when being operated within 353K. Over the past five decades, technology scaling, as predicted by Moore\u27s law, has been the bedrock of semiconductor technology evolution. The continued downscaling of CMOS technology to deep sub-micron gate lengths has been the primary reason for its dominance in today\u27s omnipresent silicon microchips. Even as the transition to the next technology node is indispensable, the initial cost and time associated in doing so presents a non-level playing field for the competitors in the semiconductor business. As part of this thesis, we evaluate the capability of speculative reliable overclocking mechanisms to maximize performance at a given technology level. We evaluate its competitiveness when compared to technology scaling, in terms of performance, power consumption, energy and energy delay product. We present a comprehensive comparison for integer and floating point SPEC2000 benchmarks running on a simulated Alpha processor at three different technology nodes in normal and enhanced modes. Our results suggest that adopting reliable overclocking strategies will help skip a technology node altogether, or be competitive in the market, while porting to the next technology node. Reliability has become a serious concern as systems embrace nanometer technologies. In this dissertation, we propose a novel fault tolerant aggressive system that combines soft error protection and timing error tolerance. We replicate both the pipeline registers and the pipeline stage combinational logic. The replicated logic receives its inputs from the primary pipeline registers while writing its output to the replicated pipeline registers. The organization of redundancy in the proposed Conjoined Pipeline system supports overclocking, provides concurrent error detection and recovery capability for soft errors, intermittent faults and timing errors, and flags permanent silicon defects. The fast recovery process requires no checkpointing and takes three cycles. Back annotated post-layout gate-level timing simulations, using 45nm technology, of a conjoined two-stage arithmetic pipeline and a conjoined five-stage DLX pipeline processor, with forwarding logic, show that our approach, even under a severe fault injection campaign, achieves near 100% fault coverage and an average performance improvement of about 20%, when dynamically overclocked
    corecore