10 research outputs found
An asynchronous instruction length decoder
Journal ArticleAbstract-This paper describes an investigation of potential advantages and pitfalls of applying an asynchronous design methodology to an advanced microprocessor architecture. A prototype complex instruction set length decoding and steering unit was implemented using self-timed circuits. [The Revolving Asynchronous Pentium® Processor Instruction Decoder (RAPPID) design implemented the complete Pentium II® 32-bit MMX instruction set.] The prototype chip was fabricated on a 0.25-CMOS process and tested successfully. Results show significant advantages-in particular, performance of 2.5-4.5 instructions per nanosecond-with manageable risks using this design technology. The prototype achieves three times the throughput and half the latency, dissipating only half the power and requiring about the same area as the fastest commercial 400-MHz clocked circuit fabricated on the same process
An asynchronous instruction length decoder
Journal ArticleThis paper describes an investigation of potential advantages and pitfalls of applying an asynchronous design methodology to an advanced microprocessor architecture. A prototype complex instruction set length decoding and steering unit was implemented using self-timed circuits. [The Revolving Asynchronous Pentium® Processor Instruction Decoder (RAPPID) design implemented the complete Pentium II® 32-bit MMX instruction set.] The prototype chip was fabricated on a 0.25-CMOS process and tested successfully. Results show significant advantages-in particular, performance of 2.5-4.5 instructions per nanosecond-with manageable risks using this design technology. The prototype achieves three times the throughput and half the latency, dissipating only half the power and requiring about the same area as the fastest commercial 400-MHz clocked circuit fabricated on the same process
Synthesis Of Self-resetting Stage Logic Pipelines
As designers began to pack multi-million transistors onto a single chip, their reliance on a global clocking signal to orchestrate the operations of the chip has started to face almost insurmountable difficulties. As a result, designers started to explore clockless circuits to avoid the global clocking problem. Recently, self-resetting circuits implemented in dynamic logic families have been proposed as viable clockless alternatives. While these circuits can produce excellent performances, they display serious limitations in terms of area cost and power consumption. A middle-of-the-road alternative, which can provide a good performance and avoid the limitations seen in dynamic self-resetting circuits, would be to implement self-resetting behavior in static circuits. This alternative has been introduced recently as Self-Resetting Stage Logic and used to propose three types of clockless pipelines. Experimental studies show that these pipelines have the potential to produce high throughputs with a minimum area overhead if a suitable synthesis methodology is available. This thesis proposes a novel synthesis methodology to design and verify clockless pipelines implemented in SRSL by taking advantage of the maturity of current CAD tools. This methodology formulates the synthesis problem as a combinatorial analytical problem for which a run-time efficient exact solution is difficult to derive. Consequently, a two-phase algorithm is proposed to synthesize these pipelines from gate netlists subject to user-specified constraints. The first phase is a heuristic based on the as-soon-as-possible scheduling strategy in which each gate of the netlist is assigned to a single pipeline stage without violating the period constraint of each pipeline stage. On the other hand, the second phase consists of a heuristic, based on the Kernighan-Lin partitioning strategy, to minimize the number of nets crossing each pair of adjacent pipeline stages. The objective of this optimization is to reduce the number of latches separating pipeline stages since these latches tend to occupy large areas. Experiments conducted on a prototype of the synthesis algorithm reveal that these self-resetting stage logic pipelines can easily reach throughputs higher than 1 GHz. Furthermore, these experiments reveal that the area overhead needed to implement the self-resetting circuitry of these pipelines can be easily amortized over the area of the logic embedded in the pipeline stages. In the overall, the synthesis methods developed for SRSL produce low area overhead pipelines for wide and deep gate netlists while it tends to produce high throughput pipelines for wide and shallow gate netlists. This shows that these pipelines are mostly suitable for coarse-grain datapaths
Design And Synthesis Of Clockless Pipelines Based On Self-resetting Stage Logic
For decades, digital design has been primarily dominated by clocked circuits. With larger scales of integration made possible by improved semiconductor manufacturing techniques, relying on a clock signal to orchestrate logic operations across an entire chip became increasingly difficult. Motivated by this problem, designers are currently considering circuits which can operate without a clock. However, the wide acceptance of these circuits by the digital design community requires two ingredients: (i) a unified design methodology supported by widely available CAD tools, and (ii) a granularity of design techniques suitable for synthesizing large designs. Currently, there is no unified established design methodology to support the design and verification of these circuits. Moreover, the majority of clockless design techniques is conceived at circuit level, and is subsequently so fine-grain, that their application to large designs can have unacceptable area costs. Given these considerations, this dissertation presents a new clockless technique, called self-resetting stage logic (SRSL), in which the computation of a block is reset periodically from within the block itself. SRSL is used as a building block for three coarse-grain pipelining techniques: (i) Stage-controlled self-resetting stage logic (S-SRSL) Pipelines: In these pipelines, the control of the communication between stages is performed locally between each pair of stages. This communication is performed in a uni-directional manner in order to simplify its implementation. (ii) Pipeline-controlled self-resetting stage logic (P-SRSL) Pipelines: In these pipelines, the communication between each pair of stages in the pipeline is driven by the oscillation of the last pipeline stage. Their communication scheme is identical to the one used in S-SRSL pipelines. (iii) Delay-tolerant self-resetting stage logic (D-SRSL) Pipelines: While communication in these pipelines is local in nature in a manner similar to the one used in S-SRL pipelines, this communication is nevertheless extended in both directions. The result of this bi-directional approach is an increase in the capability of the pipeline to handle stages with random delay. Based on these pipelining techniques, a new design methodology is proposed to synthesize clockless designs. The synthesis problem consists of synthesizing an SRSL pipeline from a gate netlist with a minimum area overhead given a specified data rate. A two-phase heuristic algorithm is proposed to solve this problem. The goal of the algorithm is to pipeline a given datapath by minimizing the area occupied by inter-stage latches without violating any timing constraints. Experiments with this synthesis algorithm show that while P-SRSL pipelines can reach high throughputs in shallow pipelines, D-SRSL pipelines can achieve comparable throughputs in deeper pipelines
Study on the maximum speed and reliability assurance in wave pipeline-based combinational circuits
Scope and Method of Study: Wave pipeline is one of the revolutionary technologies beyond conventional pipeline in the microprocessor architecture research area. Clockless wave pipeline is the cutting-edge and innovative pipeline without relying on clock signal. Due to the stringent requirement for high density and performance of current VLSI technology, reliability is being considered as one of the most crucial issues. Reliability modeling and optimization techniques have been applied extensively. Clock frequency is one of the keys to achieve the fast circuit speed. A clock cycle time optimization and analysis method is proposed in order to achieve a ultra high clock frequency in the context of the proposed new wave pipeline.Findings and Conclusions: The reliability-driven design and optimization techniques for the clockless wave pipeline are proposed. It is mainly focused on the two parts, i.e., request signal and datawave. A more aggressive technology beyond wave pipeline with ultra-high throughput and speed towards maximum circuit frequency is mainly proposed, analyzed, simulated, and verified
Realization and Formal Analysis of Asynchronous Pulse Communication Circuits
This work presents an approach to constructing asynchronous pulsed communication circuits. These circuits use small delay elements to introduce a gate level sense of time, removing the need for either a clock or handshaking signal to be part of a high-speed communication link. This construction method allows the creation of links with better than normal jitter tolerance, allowing for simple circuit architectures that can easily be made robust to radiation induced soft error. A 5Gbps radiation-hardened link, targeted at use in detector modules at the LHC, will be presented. This application presents a special challenge due to both very high radiation levels (1+MGy life time dose) and the demand for minimum resource (area, power, cable cost) use. The presented link, realized in 130nm technology, is unique in that it has low power (~50mW end to end) and very low area 0.12mm^2 including electrostatic discharge protection, and I/O amplifiers. Due to its asynchronous construction and the gate design style, the link has essentially zero power dissipation when idle, and enters and exits its idle state with no delay. In addition to the construction of the link, this presentation covers the design and analysis methodology that can be used to create other asynchronous communication circuits. The methodology achieves higher performance than conventional static technology but needs only a reasonable design effort using tools and strategies that are only mildly extended versions of those familiar to digital static designers. It is used to construct the serializer, deserializer, and self-test circuitry for the presented link. In this case, a 5Gbps SER/DES and a 2GHz parallel pseudo-random number generator are implemented in 130nm CMOS technology using a gate design style that does not dissipate static power
Recommended from our members
Design and performance optimization of asynchronous networks-on-chip
As digital systems continue to grow in complexity, the design of conventional synchronous systems is facing unprecedented challenges. The number of transistors on individual chips is already in the multi-billion range, and a greatly increasing number of components are being integrated onto a single chip. As a consequence, modern digital designs are under strong time-to-market pressure, and there is a critical need for composable design approaches for large complex systems.
In the past two decades, networks-on-chip (NoC’s) have been a highly active research area. In a NoC-based system, functional blocks are first designed individually and may run at different clock rates. These modules are then connected through a structured network for on-chip global communication. However, due to the rigidity of centrally-clocked NoC’s, there have been bottlenecks of system scalability, energy and performance, which cannot be easily solved with synchronous approaches. As a result, there has been significant recent interest in combing the notion of asynchrony with NoC designs. Since the NoC approach inherently separates the communication infrastructure, and its timing, from computational elements, it is a natural match for an asynchronous paradigm. Asynchronous NoC’s, therefore, enable a modular and extensible system composition for an ‘object-orient’ design style.
The thesis aims to significantly advance the state-of-art and viability of asynchronous and globally-asynchronous locally-synchronous (GALS) networks-on-chip, to enable high-performance and low-energy systems. The proposed asynchronous NoC’s are nearly entirely based on standard cells, which eases their integration into industrial design flows. The contributions are instantiated in three different directions.
First, practical acceleration techniques are proposed for optimizing the system latency, in order to break through the latency bottleneck in the memory interfaces of many on-chip parallel processors. Novel asynchronous network protocols are proposed, along with concrete NoC designs. A new concept, called ‘monitoring network’, is introduced. Monitoring networks are lightweight shadow networks used for fast-forwarding anticipated traffic information, ahead of the actual packet traffic. The routers are therefore allowed to initiate and perform arbitration and channel allocation in advance. The technique is successfully applied to two topologies which belong to two different categories – a variant mesh-of-trees (MoT) structure and a 2D-mesh topology. Considerable and stable latency improvements are observed across a wide range of traffic patterns, along with moderate throughput gains.
Second, for the first time, a high-performance and low-power asynchronous NoC router is compared directly to a leading commercial synchronous counterpart in an advanced industrial technology. The asynchronous router design shows significant performance improvements, as well as area and power savings. The proposed asynchronous router integrates several advanced techniques, including a low-latency circular FIFO for buffer design, and a novel end-to-end credit-based virtual channel (VC) flow control. In addition, a semi-automated design flow is created, which uses portions of a standard synchronous tool flow.
Finally, a high-performance multi-resource asynchronous arbiter design is developed. This small but important component can be directly used in existing asynchronous NoC’s for performance optimization. In addition, this standalone design promises use in opening up new NoC directions, as well as for general use in parallel systems. In the proposed arbiter design, the allocation of a resource to a client is divided into several steps. Multiple successive client-resource pairs can be selected rapidly in pipelined sequence, and the completion of the assignments can overlap in parallel.
In sum, the thesis provides a set of advanced design solutions for performance optimization of asynchronous and GALS networks-on-chip. These solutions are at different levels, from network protocols, down to router- and component-level optimizations, which can be directly applied to existing basic asynchronous NoC designs to provide a leap in performance improvement
Doctor of Philosophy
dissertationAsynchronous design has a very promising potential even though it has largely received a cold reception from industry. Part of this reluctance has been due to the necessity of custom design languages and computer aided design (CAD) flows to design, optimize, and validate asynchronous modules and systems. Next generation asynchronous flows should support modern programming languages (e.g., Verilog) and application specific integrated circuits (ASIC) CAD tools. They also have to support multifrequency designs with mixed synchronous (clocked) and asynchronous (unclocked) designs. This work presents a novel relative timing (RT) based methodology for generating multifrequency designs using synchronous CAD tools and flows. Synchronous CAD tools must be constrained for them to work with asynchronous circuits. Identification of these constraints and characterization flow to automatically derive the constraints is presented. The effect of the constraints on the designs and the way they are handled by the synchronous CAD tools are analyzed and reported in this work. The automation of the generation of asynchronous design templates and also the constraint generation is an important problem. Algorithms for automation of reset addition to asynchronous circuits and power and/or performance optimizations applied to the circuits using logical effort are explored thus filling an important hole in the automation flow. Constraints representing cyclic asynchronous circuits as directed acyclic graphs (DAGs) to the CAD tools is necessary for applying synchronous CAD optimizations like sizing, path delay optimizations and also using static timing analysis (STA) on these circuits. A thorough investigation for the requirements of cycle cutting while preserving timing paths is presented with an algorithm to automate the process of generating them. A large set of designs for 4 phase handshake protocol circuit implementations with early and late data validity are characterized for area, power and performance. Benchmark circuits with automated scripts to generate various configurations for better understanding of the designs are proposed and analyzed. Extension to the methodology like addition of scan insertion using automatic test pattern generation (ATPG) tools to add testability of datapath in bundled data asynchronous circuit implementations and timing closure approaches are also described. Energy, area, and performance of purely asynchronous circuits and circuits with mixed synchronous and asynchronous blocks are explored. Results indicate the benefits that can be derived by generating circuits with asynchronous components using this methodology
The 1991 3rd NASA Symposium on VLSI Design
Papers from the symposium are presented from the following sessions: (1) featured presentations 1; (2) very large scale integration (VLSI) circuit design; (3) VLSI architecture 1; (4) featured presentations 2; (5) neural networks; (6) VLSI architectures 2; (7) featured presentations 3; (8) verification 1; (9) analog design; (10) verification 2; (11) design innovations 1; (12) asynchronous design; and (13) design innovations 2
NASA Space Engineering Research Center Symposium on VLSI Design
The NASA Space Engineering Research Center (SERC) is proud to offer, at its second symposium on VLSI design, presentations by an outstanding set of individuals from national laboratories and the electronics industry. These featured speakers share insights into next generation advances that will serve as a basis for future VLSI design. Questions of reliability in the space environment along with new directions in CAD and design are addressed by the featured speakers