Search CORE

266 research outputs found

低電力非同期回路の面積高効率化設計

Author: Xia Zhengfan
Publication venue
Publication date: 20/01/2015
Field of study

Tohoku University亀山充隆課

Tohoku University Repository (TOUR) / 東北大学機関リポジトリ

CA-BIST for asynchronous circuits: a case study on the RAPPID asynchronous instruction length decoder

Author: Roncken Marly
Stevens Kenneth
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2000
Field of study

Journal ArticleThis paper presents a case study in low-cost noninvasive Built-In Self Test (BIST) for RAPPID, a largescale 120,000-transistor asynchronous version of the Pentium® Pro Instruction Length Decoder, which runs at 3.6 GHz. RAPPID uses a synchronous 0.25 micron CMOS library for static and domino logic, and has no Design-for-Test hooks other than some debug features. We explore the use of Cellular Automata (CA) for on-chip test pattern generation and response evaluation. More specifically, we look for fast ways to tune the CA-BIST to the RAPPID design, rather than using pseudo-random testing. The metric for tuning the CA-BIST pattern generation is based on an abstract hardware description model of the instruction length decoder, which is independent of implementation details, and hence also independent of the asynchronous circuit style. Our CA-BI ST solution uses a novel bootstrap procedure for generating the test patterns, which give complete coverage for this metric, and cover 94% of the testable stuck-at faults for the actual design at switch level. Analysis of the undetected and untestable faults shows that the same fault effects can be expected for a similar clocked circuit. This is encouraging evidence that testability is no excuse to avoid asynchronous design techniques in addition to high-performance synchronous solutions

The University of Utah: J. Willard Marriott Digital Library

CA-BIST for asynchronous circuits: a case study on the RAPPID asynchronous instruction length decoder

Author: Stevens Kenneth
Roncken Marly
Publication venue: Institute of Electrical and Electronics Engineers (IEEE)
Publication date: 01/01/2000
Field of study

The University of Utah: J. Willard Marriott Digital Library

opac.isi.ac.id

Indonesian Institute of the Art Yogyakarta

Asynchronous techniques for system-on-chip design

Author: Martin Alain J.
Nyström Mika
Publication venue
Publication date: 01/06/2006
Field of study

SoC design will require asynchronous techniques as the large parameter variations across the chip will make it impossible to control delays in clock networks and other global signals efficiently. Initially, SoCs will be globally asynchronous and locally synchronous (GALS). But the complexity of the numerous asynchronous/synchronous interfaces required in a GALS will eventually lead to entirely asynchronous solutions. This paper introduces the main design principles, methods, and building blocks for asynchronous VLSI systems, with an emphasis on communication and synchronization. Asynchronous circuits with the only delay assumption of isochronic forks are called quasi-delay-insensitive (QDI). QDI is used in the paper as the basis for asynchronous logic. The paper discusses asynchronous handshake protocols for communication and the notion of validity/neutrality tests, and completion tree. Basic building blocks for sequencing, storage, function evaluation, and buses are described, and two alternative methods for the implementation of an arbitrary computation are explained. Issues of arbitration, and synchronization play an important role in complex distributed systems and especially in GALS. The two main asynchronous/synchronous interfaces needed in GALS-one based on synchronizer, the other on stoppable clock-are described and analyzed

Caltech Authors

Low Power Processor Architectures and Contemporary Techniques for Power Optimization – A Review

Author: Gujarathi Hemal S
McDonald-Maier Klaus D
Qadri Muhammad Yasir
Publication venue: 'Academy Publisher'
Publication date: 01/01/2009
Field of study

The technological evolution has increased the number of transistors for a given die area significantly and increased the switching speed from few MHz to GHz range. Such inversely proportional decline in size and boost in performance consequently demands shrinking of supply voltage and effective power dissipation in chips with millions of transistors. This has triggered substantial amount of research in power reduction techniques into almost every aspect of the chip and particularly the processor cores contained in the chip. This paper presents an overview of techniques for achieving the power efficiency mainly at the processor core level but also visits related domains such as buses and memories. There are various processor parameters and features such as supply voltage, clock frequency, cache and pipelining which can be optimized to reduce the power consumption of the processor. This paper discusses various ways in which these parameters can be optimized. Also, emerging power efficient processor architectures are overviewed and research activities are discussed which should help reader identify how these factors in a processor contribute to power consumption. Some of these concepts have been already established whereas others are still active research areas. © 2009 ACADEMY PUBLISHER

University of Essex Research Repository

CiteSeerX

Crossref

DUAL-RAIL GATE STRUCTURE FOR A COMPLEX DATA PATH

Author: Mahesh Y
Nagasandeep Viswanadham
Publication venue: International Journal of Innovative Technology and Research
Publication date: 23/10/2016
Field of study

Dual-rail domino gates are restricted to create a reliable critical data path. According to this critical data path, the handshake circuits are greatly simplified, that provides the pipeline high throughput in addition to low power consumption. This paper presents a higher-throughput and ultralow-power asynchronous domino logic pipeline design method, targeting to latch-free and very fine-grain or gate-level design. The information pathways are comprised of a combination of dual-rail and single-rail domino gates. The 4 phase bundled-data protocol design most carefully resembles the style of synchronous circuits. Furthermore, the stable critical data path enables the adoption of single-rail domino gates within the noncritical data pathways. An 8 × 8 array style multiplier can be used for evaluating the suggested pipeline method. This saves lots of power by reduction of the overhead of logic circuits. In contrast to a bundled-data asynchronous domino logic pipeline, the suggested pipeline saves energy within the best situation and also the worst situation when processing different data patterns

International Journal of Innovative Technology and Research (IJITR)

Doctor of Philosophy

Author: Das Shomit
Publication venue: University of Utah
Publication date: 01/01/2017
Field of study

dissertationCommunication surpasses computation as the power and performance bottleneck in forthcoming exascale processors. Scaling has made transistors cheap, but on-chip wires have grown more expensive, both in terms of latency as well as energy. Therefore, the need for low energy, high performance interconnects is highly pronounced, especially for long distance communication. In this work, we examine two aspects of the global signaling problem. The first part of the thesis focuses on a high bandwidth asynchronous signaling protocol for long distance communication. Asynchrony among intellectual property (IP) cores on a chip has become necessary in a System on Chip (SoC) environment. Traditional asynchronous handshaking protocol suffers from loss of throughput due to the added latency of sending the acknowledge signal back to the sender. We demonstrate a method that supports end-to-end communication across links with arbitrarily large latency, without limiting the bandwidth, so long as line variation can be reliably controlled. We also evaluate the energy and latency improvements as a result of the design choices made available by this protocol. The use of transmission lines as a physical interconnect medium shows promise for deep submicron technologies. In our evaluations, we notice a lower energy footprint, as well as vastly reduced wire latency for transmission line interconnects. We approach this problem from two sides. Using field solvers, we investigate the physical design choices to determine the optimal way to implement these lines for a given back-end-of-line (BEOL) stack. We also approach the problem from a system designer's viewpoint, looking at ways to optimize the lines for different performance targets. This work analyzes the advantages and pitfalls of implementing asynchronous channel protocols for communication over long distances. Finally, the innovations resulting from this work are applied to a network-on-chip design example and the resulting power-performance benefits are reported

The University of Utah: J. Willard Marriott Digital Library

VLSI implementation of discrete cosine transform using a new asynchronous pipelined architecture.

Author
Publication venue
Publication date: 01/01/2002
Field of study

Lee Chi-wai.Thesis (M.Phil.)--Chinese University of Hong Kong, 2002.Includes bibliographical references (leaves 191-196).Abstracts in English and Chinese.Abstract of this thesis entitled: --- p.i摘要 --- p.iiiAcknowledgements --- p.vTable of Contents --- p.viiList of Tables --- p.xList of Figures --- p.xiChapter Chapter1 --- Introduction --- p.1Chapter 1.1 --- Synchronous Design --- p.1Chapter 1.2 --- Asynchronous Design --- p.2Chapter 1.3 --- Discrete Cosine Transform --- p.4Chapter 1.4 --- Motivation --- p.5Chapter 1.5 --- Organization of the Thesis --- p.6Chapter Chapter2 --- Asynchronous Design Methodology --- p.7Chapter 2.1 --- Overview --- p.7Chapter 2.2 --- Background --- p.8Chapter 2.3 --- Past Designs --- p.10Chapter 2.4 --- Micropipeline --- p.12Chapter 2.5 --- New Asynchronous Architecture --- p.15Chapter Chapter3 --- DCT/IDCT Processor Design Methodology --- p.24Chapter 3.1 --- Overview --- p.24Chapter 3.2 --- Hardware Architecture --- p.25Chapter 3.3 --- DCT Algorithm --- p.26Chapter 3.4 --- Used Architecture and DCT Algorithm --- p.30Chapter 3.4.1 --- Implementation on Programmable DSP Processor --- p.31Chapter 3.4.2 --- Implementation on Dedicated Processor --- p.33Chapter Chapter4 --- New Techniques for Operating Dynamic Logic in Low Frequency --- p.36Chapter 4.1 --- Overview --- p.36Chapter 4.2 --- Background --- p.37Chapter 4.3 --- Traditional Technique --- p.39Chapter 4.4 --- New Technique - Refresh Control Circuit --- p.40Chapter 4.4.1 --- Principle --- p.41Chapter 4.4.2 --- Voltage Sensor --- p.42Chapter 4.4.3 --- Ring Oscillator --- p.43Chapter 4.4.4 --- "Counter, Latch and Comparator" --- p.46Chapter 4.4.5 --- Recalibrate Circuit --- p.47Chapter 4.4.6 --- Operation Monitoring Circuit --- p.48Chapter 4.4.7 --- Overall Circuit --- p.48Chapter Chapter5 --- DCT Implementation on Programmable DSP Processor --- p.51Chapter 5.1 --- Overview --- p.51Chapter 5.2 --- Processor Architecture --- p.52Chapter 5.2.1 --- Arithmetic Unit --- p.53Chapter 5.2.2 --- Switching Network --- p.56Chapter 5.2.3 --- FIFO Memory --- p.59Chapter 5.2.4 --- Instruction Memory --- p.60Chapter 5.3 --- Programming --- p.62Chapter 5.4 --- DCT Implementation --- p.63Chapter Chapter6 --- DCT Implementation on Dedicated DCT Processor --- p.66Chapter 6.1 --- Overview --- p.66Chapter 6.2 --- DCT Chip Architecture --- p.67Chapter 6.2.1 --- ID DCT Core --- p.68Chapter 6.2.1.1 --- Core Architecture --- p.74Chapter 6.2.1.2 --- Flow of Operation --- p.76Chapter 6.2.1.3 --- Data Replicator --- p.79Chapter 6.2.1.4 --- DCT Coefficients Memory --- p.80Chapter 6.2.2 --- Combination of IDCT to 1D DCT core --- p.82Chapter 6.2.3 --- Accuracy --- p.85Chapter 6.3 --- Transpose Memory --- p.87Chapter 6.3.1 --- Architecture --- p.89Chapter 6.3.2 --- Address Generator --- p.91Chapter 6.3.3 --- RAM Block --- p.94Chapter Chapter7 --- Results and Discussions --- p.97Chapter 7.1 --- Overview --- p.97Chapter 7.2 --- Refresh Control Circuit --- p.97Chapter 7.2.1 --- Implementation Results and Performance --- p.97Chapter 7.2.2 --- Discussion --- p.100Chapter 7.3 --- Programmable DSP Processor --- p.102Chapter 7.3.1 --- Implementation Results and Performance --- p.102Chapter 7.3.2 --- Discussion --- p.104Chapter 7.4 --- ID DCT/IDCT Core --- p.107Chapter 7.4.1 --- Simulation Results --- p.107Chapter 7.4.2 --- Measurement Results --- p.109Chapter 7.4.3 --- Discussion --- p.113Chapter 7.5 --- Transpose Memory --- p.122Chapter 7.5.1 --- Simulated Results --- p.122Chapter 7.5.2 --- Measurement Results --- p.123Chapter 7.5.3 --- Discussion --- p.126Chapter Chapter8 --- Conclusions --- p.130Appendix --- p.133Operations of switches in DCT implementation of programmable DSP processor --- p.133C Program for evaluating the error in DCT/IDCT core --- p.135Pin Assignments of the Programmable DSP Processor Chip --- p.142Pin Assignments of the 1D DCT/IDCT Core Chip --- p.144Pin Assignments of the Transpose Memory Chip --- p.147Chip microphotograph of the 1D DCT/IDCT core --- p.150Chip Microphotograph of the Transpose Memory --- p.151Measured Waveforms of 1D DCT/IDCT Chip --- p.152Measured Waveforms of Transpose Memory Chip --- p.156Schematics of Refresh Control Circuit --- p.158Schematics of Programmable DSP Processor --- p.164Schematics of 1D DCT/IDCT Core --- p.180Schematics of Transpose Memory --- p.187References --- p.191Design Libraries - CD-ROM --- p.19

CUHK Digital Repository

Null convention logic circuits for asynchronous computer architecture

Author: Kim M
Publication venue: RMIT University
Publication date
Field of study

For most of its history, computer architecture has been able to benefit from a rapid scaling in semiconductor technology, resulting in continuous improvements to CPU design. During that period, synchronous logic has dominated because of its inherent ease of design and abundant tools. However, with the scaling of semiconductor processes into deep sub-micron and then to nano-scale dimensions, computer architecture is hitting a number of roadblocks such as high power and increased process variability. Asynchronous techniques can potentially offer many advantages compared to conventional synchronous design, including average case vs. worse case performance, robustness in the face of process and operating point variability and the ready availability of high performance, fine grained pipeline architectures. Of the many alternative approaches to asynchronous design, Null Convention Logic (NCL) has the advantage that its quasi delay-insensitive behavior makes it relatively easy to set up complex circuits without the need for exhaustive timing analysis. This thesis examines the characteristics of an NCL based asynchronous RISC-V CPU and analyses the problems with applying NCL to CPU design. While a number of university and industry groups have previously developed small 8-bit microprocessor architectures using NCL techniques, it is still unclear whether these offer any real advantages over conventional synchronous design. A key objective of this work has been to analyse the impact of larger word widths and more complex architectures on NCL CPU implementations. The research commenced by re-evaluating existing techniques for implementing NCL on programmable devices such as FPGAs. The little work that has been undertaken previously on FPGA implementations of asynchronous logic has been inconclusive and seems to indicate that asynchronous systems cannot be easily implemented in these devices. However, most of this work related to an alternative technique called bundled data, which is not well suited to FPGA implementation because of the difficulty in controlling and matching delays in a 'bundle' of signals. On the other hand, this thesis clearly shows that such applications are not only possible with NCL, but there are some distinct advantages in being able to prototype complex asynchronous systems in a field-programmable technology such as the FPGA. A large part of the value of NCL derives from its architectural level behavior, inherent pipelining, and optimization opportunities such as the merging of register and combina- tional logic functions. In this work, a number of NCL multiplier architectures have been analyzed to reveal the performance trade-offs between various non-pipelined, 1D and 2D organizations. Two-dimensional pipelining can easily be applied to regular architectures such as array multipliers in a way that is both high performance and area-efficient. It was found that the performance of 2D pipelining for small networks such as multipliers is around 260% faster than the equivalent non-pipelined design. However, the design uses 265% more transistors so the methodology is mainly of benefit where performance is strongly favored over area. A pipelined 32bit x 32bit signed Baugh-Wooley multiplier with Wallace-Tree Carry Save Adders (CSA), which is representative of a real design used for CPUs and DSPs, was used to further explore this concept as it is faster and has fewer pipeline stages compared to the normal array multiplier using Ripple-Carry adders (RCA). It was found that 1D pipelining with ripple-carry chains is an efficient implementation option but becomes less so for larger multipliers, due to the completion logic for which the delay time depends largely on the number of bits involved in the completion network. The average-case performance of ripple-carry adders was explored using random input vectors and it was observed that it offers little advantage on the smaller multiplier blocks, but this particular timing characteristic of asynchronous design styles be- comes increasingly more important as word size grows. Finally, this research has resulted in the development of the first 32-Bit asynchronous RISC-V CPU core. Called the Redback RISC, the architecture is a structure of pipeline rings composed of computational oscillations linked with flow completeness relationships. It has been written using NELL, a commercial description/synthesis tool that outputs standard Verilog. The Redback has been analysed and compared to two approximately equivalent industry standard 32-Bit synchronous RISC-V cores (PicoRV32 and Rocket) that are already fabricated and used in industry. While the NCL implementation is larger than both commercial cores it has similar performance and lower power compared to the PicoRV32. The implementation results were also compared against an existing NCL design tool flow (UNCLE), which showed how much the results of these implementation strategies differ. The Redback RISC has achieved similar level of throughput and 43% better power and 34% better energy compared to one of the synchronous cores with the same benchmark test and test condition such as input sup- ply voltage. However, it was shown that area is the biggest drawback for NCL CPU design. The core is roughly 2.5&times; larger than synchronous designs. On the other hand its area is still 2.9&times; smaller than previous designs using UNCLE tools. The area penalty is largely due to the unavoidable translation into a dual-rail topology when using the standard NCL cell library

RMIT Research Repository

Recommended from our members

Design and performance optimization of asynchronous networks-on-chip

Author: Jiang Weiwei
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2018
Field of study

As digital systems continue to grow in complexity, the design of conventional synchronous systems is facing unprecedented challenges. The number of transistors on individual chips is already in the multi-billion range, and a greatly increasing number of components are being integrated onto a single chip. As a consequence, modern digital designs are under strong time-to-market pressure, and there is a critical need for composable design approaches for large complex systems. In the past two decades, networks-on-chip (NoC’s) have been a highly active research area. In a NoC-based system, functional blocks are first designed individually and may run at different clock rates. These modules are then connected through a structured network for on-chip global communication. However, due to the rigidity of centrally-clocked NoC’s, there have been bottlenecks of system scalability, energy and performance, which cannot be easily solved with synchronous approaches. As a result, there has been significant recent interest in combing the notion of asynchrony with NoC designs. Since the NoC approach inherently separates the communication infrastructure, and its timing, from computational elements, it is a natural match for an asynchronous paradigm. Asynchronous NoC’s, therefore, enable a modular and extensible system composition for an ‘object-orient’ design style. The thesis aims to significantly advance the state-of-art and viability of asynchronous and globally-asynchronous locally-synchronous (GALS) networks-on-chip, to enable high-performance and low-energy systems. The proposed asynchronous NoC’s are nearly entirely based on standard cells, which eases their integration into industrial design flows. The contributions are instantiated in three different directions. First, practical acceleration techniques are proposed for optimizing the system latency, in order to break through the latency bottleneck in the memory interfaces of many on-chip parallel processors. Novel asynchronous network protocols are proposed, along with concrete NoC designs. A new concept, called ‘monitoring network’, is introduced. Monitoring networks are lightweight shadow networks used for fast-forwarding anticipated traffic information, ahead of the actual packet traffic. The routers are therefore allowed to initiate and perform arbitration and channel allocation in advance. The technique is successfully applied to two topologies which belong to two different categories – a variant mesh-of-trees (MoT) structure and a 2D-mesh topology. Considerable and stable latency improvements are observed across a wide range of traffic patterns, along with moderate throughput gains. Second, for the first time, a high-performance and low-power asynchronous NoC router is compared directly to a leading commercial synchronous counterpart in an advanced industrial technology. The asynchronous router design shows significant performance improvements, as well as area and power savings. The proposed asynchronous router integrates several advanced techniques, including a low-latency circular FIFO for buffer design, and a novel end-to-end credit-based virtual channel (VC) flow control. In addition, a semi-automated design flow is created, which uses portions of a standard synchronous tool flow. Finally, a high-performance multi-resource asynchronous arbiter design is developed. This small but important component can be directly used in existing asynchronous NoC’s for performance optimization. In addition, this standalone design promises use in opening up new NoC directions, as well as for general use in parallel systems. In the proposed arbiter design, the allocation of a resource to a client is divided into several steps. Multiple successive client-resource pairs can be selected rapidly in pipelined sequence, and the completion of the assignments can overlap in parallel. In sum, the thesis provides a set of advanced design solutions for performance optimization of asynchronous and GALS networks-on-chip. These solutions are at different levels, from network protocols, down to router- and component-level optimizations, which can be directly applied to existing basic asynchronous NoC designs to provide a leap in performance improvement

Columbia University Academic Commons