19 research outputs found

    Current Sensing Completion Detection in Single-Rail Asynchronous Systems

    Get PDF
    In this article, an alternative approach to detecting the computation completion of combinatorial blocks in asynchronous digital systems is presented. The proposed methodology is based on well-known phenomenon that occurs in digital systems fabricated in CMOS technology. Such logic circuits exhibit significantly higher current consumption during the signal transitions than in the idle state. Duration of these current peaks correlates very well with the actual computation time of the combinatorial block. Hence, this fact can be exploited for separation of the computation activity from static state. The paper presents fundamental background of addressed alternative completion detection and its implementation in single-rail encoded asynchronous systems, the proposed current sensing circuitry, achieved simulation results as well as the comparison to the state-of-the-art methods of completion detection. The presented method promises the enhancement of the performance of an asynchronous circuit, and under certain circumstances it also reduces the silicon area requirements of the completion detection block

    Self-timed field programmmable gate array architectures

    Get PDF

    동기 회로에서 시간 오류를 고려한 공급전압 제어

    Get PDF
    학위논문 (석사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2015. 2. 최기영.Modern embedded systems are becoming more and more constrained by power consumption. While we require those systems to compute even more data at faster speed, lowering energy consumption is essential to preserve battery life as well as integrity of devices. Amongst many techniques to reduce power consumption of chips such as power gating, clock gating, etc., lowering the supply voltage (maybe reducing chips frequency) is known to be the most effective one. However, lowering the supply voltage of chips too much down to near the threshold voltage of transistors causes the logic delay to vary exponentially with intrinsic and extrinsic variations (process variations, temperature, aging, etc.) and thus forces the designer to set increased timing margin. This thesis proposes a technique for automatically adjusting the supply voltage to match the speed of a logic block with a given time constraint. Depending on process and temperature variations, our technique chooses the minimum supply voltage to satisfy the timing constraint defined by the designer. This allows him/her to reduce the default supply voltage of the logic block and thus save power. In our experiments at the 28/32nm technology node, we succeeded in reducing the logic block power by 52% on average by varying the supply voltage between 0.55V and 1V, while the nominal supply voltage is 1.05V.Abstract Contents List of Figures List of Tables Chapter 1 Introduction 1 Chapter 2 Background 5 1.1 Near-Threshold Computing 5 1.2 Current Sensing Completion Detection 7 Chapter 3 Proposed Approach 12 Chapter 4 Experimental setup 16 4.1 Intrinsic Variations 16 4.2 Extrinsic Variations 17 4.3 Control Block 17 4.4 Logic Block 17 4.5 Experimental parameters 19 Chapter 5 Experimental Results 20 5.1 Results at the TT 22 5.2 Result at the FF 22 5.3 Results at the SS 22 5.4 Effect on temperature 25U 5.5 Final power savings 26 Chapter 6 Conclusion and future work 29 Bibliography 31Maste

    Asynchronous memory design.

    Get PDF
    by Vincent Wing-Yun Sit.Thesis submitted in: June 1997.Thesis (M.Phil.)--Chinese University of Hong Kong, 1998.Includes bibliographical references (leaves 1-4 (3rd gp.)).Abstract also in Chinese.TABLE OF CONTENTSLIST OF FIGURESLIST OF TABLESACKNOWLEDGEMENTSABSTRACTChapter 1. --- INTRODUCTION --- p.1Chapter 1.1 --- ASYNCHRONOUS DESIGN --- p.2Chapter 1.1.1 --- POTENTIAL ADVANTAGES --- p.2Chapter 1.1.2 --- DESIGN METHODOLOGIES --- p.2Chapter 1.1.3 --- SYSTEM CHARACTERISTICS --- p.3Chapter 1.2 --- ASYNCHRONOUS MEMORY --- p.5Chapter 1.2.1 --- MOTIVATION --- p.5Chapter 1.2.2 --- DEFINITION --- p.9Chapter 1.3 --- PROPOSED MEMORY DESIGN --- p.10Chapter 1.3.1 --- CONTROL INTERFACE --- p.10Chapter 1.3.2 --- OVERVIEW --- p.11Chapter 1.3.3 --- HANDSHAKE CONTROL PROTOCOL --- p.13Chapter 2. --- THEORY --- p.16Chapter 2.1 --- VARIABLE BIT LINE LOAD --- p.17Chapter 2.1.1 --- DEFINITION --- p.17Chapter 2.1.2 --- ADVANTAGE --- p.17Chapter 2.2 --- CURRENT SENSING COMPLETION DETECTION --- p.18Chapter 2.2.1 --- BLOCK DIAGRAM --- p.19Chapter 2.2.2 --- GENERAL LSD CURRENT SENSOR --- p.21Chapter 2.2.3 --- CMOS LSD CURRENT SENSOR --- p.23Chapter 2.3 --- VOLTAGE SENSING COMPLETION DETECTION --- p.28Chapter 2.3.1 --- DATA READING IN MEMORY CIRCUIT --- p.29Chapter 2.3.2 --- BLOCK DIAGRAM --- p.30Chapter 2.4 --- MULTIPLE DELAYS COMPLETION GENERATION --- p.32Chapter 2.4.1 --- ADVANTAGE --- p.32Chapter 2.4.2 --- BLOCK DIAGRAM --- p.33Chapter 3. --- IMPLEMENTATION --- p.35Chapter 3.1 --- 1M-BIT SRAM FRAMEWORK --- p.36Chapter 3.1.1 --- INTRODUCTION --- p.36Chapter 3.1.2 --- FRAMEWORK --- p.36Chapter 3.2 --- CONTROL CIRCUIT --- p.40Chapter 3.2.1 --- CONTROL SIGNALS --- p.40Chapter 3.2.1.1 --- EXTERNAL CONTROL SIGNALS --- p.40Chapter 3.2.1.2 --- INTERNAL CONTROL SIGNALS --- p.41Chapter 3.2.2 --- READ / WRITE STATE TRANSITION GRAPHS --- p.42Chapter 3.2.3 --- IMPLEMENTATION --- p.43Chapter 3.3 --- BIT LINE SEGMENTATION --- p.45Chapter 3.3.1 --- FOUR REGIONS SEGMENTATION --- p.46Chapter 3.3.2 --- OPERATION --- p.50Chapter 3.3.3 --- MEMORY CELL --- p.51Chapter 3.4 --- CURRENT SENSING COMPLETION DETECTION --- p.52Chapter 3.4.1 --- ONE BIT DATA BUS --- p.53Chapter 3.4.2 --- EIGHT BITS DATA BUS --- p.55Chapter 3.5 --- VOLTAGE SENSING COMPLETION DETECTION --- p.57Chapter 3.5.1 --- ONE BIT DATA BUS --- p.57Chapter 3.5.2 --- EIGHT BITS DATA BUS --- p.59Chapter 3.6 --- MULTIPLE DELAYS COMPLETION GENERATION --- p.60Chapter 4. --- SIMULATION --- p.63Chapter 4.1 --- SIMULATION ENVIRONMENT --- p.64Chapter 4.1.1 --- SIMULATION PARAMETERS --- p.64Chapter 4.1.2 --- MEMORY TIMING SPECIFICATIONS --- p.64Chapter 4.1.3 --- BIT LINE LOAD DETERMINATION --- p.67Chapter 4.2 --- BENCHMARK SIMULATION --- p.69Chapter 4.2.1 --- CIRCUIT SCHEMATIC --- p.69Chapter 4.2.2 --- RESULTS --- p.71Chapter 4.3 --- CURRENT SENSING COMPLETION DETECTION --- p.73Chapter 4.3.1 --- CIRCUIT SCHEMATIC --- p.73Chapter 4.3.2 --- SENSE AMPLIFIER CURRENT CHARACTERISTICS --- p.75Chapter 4.3.3 --- RESULTS --- p.76Chapter 4.3.4 --- OBSERVATIONS --- p.80Chapter 4.4 --- VOLTAGE SENSING COMPLETION DETECTION --- p.82Chapter 4.4.1 --- CIRCUIT SCHEMATIC --- p.82Chapter 4.4.2 --- RESULTS --- p.83Chapter 4.5 --- MULTIPLE DELAYS COMPLETION GENERATION --- p.89Chapter 4.5.1 --- CIRCUIT SCHEMATIC --- p.89Chapter 4.5.2 --- RESULTS --- p.90Chapter 5. --- TESTING --- p.97Chapter 5.1 --- TEST CHIP DESIGN --- p.98Chapter 5.1.1 --- BLOCK DIAGRAM --- p.98Chapter 5.1.2 --- SCHEMATIC --- p.100Chapter 5.1.3 --- LAYOUT --- p.102Chapter 5.2 --- HSPICE POST-LAYOUT SIMULATION RESULTS --- p.104Chapter 5.2.1 --- GRAPHICAL RESULTS --- p.105Chapter 5.2.2 --- VOLTAGE SENSING COMPLETION DETECTION --- p.108Chapter 5.2.3 --- MULTIPLE DELAYS COMPLETION GENERATION --- p.114Chapter 5.3 --- MEASUREMENTS --- p.117Chapter 5.3.1 --- LOGIC RESULTS --- p.118Chapter 5.3.1.1 --- METHOD --- p.118Chapter 5.3.1.2 --- RESULTS --- p.118Chapter 5.3.2 --- TIMING RESULTS --- p.119Chapter 5.3.2.1 --- METHOD --- p.119Chapter 5.3.2.2 --- GRAPHICAL RESULTS --- p.121Chapter 5.3.2.3 --- VOLTAGE SENSING COMPLETION DETECTION --- p.123Chapter 5.3.2.4 --- MULTIPLE DELAYS COMPLETION GENERATION --- p.125Chapter 6. --- DISCUSSION --- p.127Chapter 6.1 --- CURRENT SENSING COMPLETION DETECTION --- p.128Chapter 6.1.1 --- COMMENTS AND CONCLUSION --- p.128Chapter 6.1.2 --- SUGGESTION --- p.128Chapter 6.2 --- VOLTAGE SENSING COMPLETION DETECTION --- p.129Chapter 6.2.1 --- RESULTS COMPARISON --- p.129Chapter 6.2.1.1 --- GENERAL --- p.129Chapter 6.2.1.2 --- BIT LINE LOAD --- p.132Chapter 6.2.1.3 --- BIT LINE SEGMENTATION --- p.133Chapter 6.2.2 --- RESOURCE CONSUMPTION --- p.133Chapter 6.2.2.1 --- AREA --- p.133Chapter 6.2.2.2 --- POWER --- p.134Chapter 6.2.3 --- COMMENTS AND CONCLUSION --- p.134Chapter 6.3 --- MULTIPLE DELAY COMPLETION GENERATION --- p.135Chapter 6.3.1 --- RESULTS COMPARISON --- p.135Chapter 6.3.1.1 --- GENERAL --- p.135Chapter 6.3.1.2 --- BIT LINE LOAD --- p.136Chapter 6.3.1.3 --- BIT LINE SEGMENTATION --- p.137Chapter 6.3.2 --- RESOURCE CONSUMPTION --- p.138Chapter 6.3.2.1 --- AREA --- p.138Chapter 6.3.2.2 --- POWER --- p.138Chapter 6.3.3 --- COMMENTS AND CONCLUSION --- p.138Chapter 6.4 --- GENERAL COMMENTS --- p.139Chapter 6.4.1 --- COMPARISON OF THE THREE TECHNIQUES --- p.139Chapter 6.4.2 --- BIT LINE SEGMENTATION --- p.141Chapter 6.5 --- APPLICATION --- p.142Chapter 6.6 --- FURTHER DEVELOPMENTS --- p.144Chapter 6.6.1 --- INTERACE WITH TWO-PHASE HCP --- p.144Chapter 6.6.2 --- DATA BUS EXPANSION --- p.146Chapter 6.6.3 --- SPEED OPTIMIZATION --- p.147Chapter 6.6.4 --- MODIFIED WRITE COMPLETION METHOD --- p.150Chapter 7. --- CONCLUSION --- p.152Chapter 7.1 --- PROBLEM DEFINITION --- p.152Chapter 7.2 --- IMPLEMENTATION --- p.152Chapter 7.3 --- EVALUATION --- p.153Chapter 7.4 --- COMMENTS AND SUGGESTIONS --- p.155Chapter 8. --- REFERENCES --- p.R-lChapter 9. --- APPENDIX --- p.A-lChapter 9.1 --- HSPICE SIMULATION PARAMETERS --- p.A-lChapter 9.1.1 --- TYPICAL SIMULATION CONDITION --- p.A-lChapter 9.1.2 --- FAST SIMULATION CONDITION --- p.A-3Chapter 9.1.3 --- SLOW SIMULATION CONDITION --- p.A-4Chapter 9.2 --- SRAM CELL LAYOUT AND NETLIST --- p.A-5Chapter 9.3 --- TEST CHIP SPECIFICATIONS --- p.A-8Chapter 9.3.1 --- GENERAL SPECIFICATIONS --- p.A-8Chapter 9.3.2 --- PIN ASSIGNMENT --- p.A-9Chapter 9.3.3 --- TIMING DIAGRAMS AND SPECIFICATIONS --- p.A-10Chapter 9.3.4 --- SCHEMATICS AND LAYOUTS --- p.A-11Chapter 9.3.4.1 --- STANDARD MEMORY COMPONENTS --- p.A-12Chapter 9.3.4.2 --- DVSCD AND MDCG COMPONENTS --- p.A-20Chapter 9.3.5 --- MICROPHOTOGRAPH --- p.A-2

    An ICT image processing chip based on fast computation algorithm and self-timed circuit technique.

    Get PDF
    by Johnson, Tin-Chak Pang.Thesis (M.Phil.)--Chinese University of Hong Kong, 1997.Includes bibliographical references.AcknowledgmentsAbstractList of figuresList of tablesChapter 1. --- Introduction --- p.1-1Chapter 1.1 --- Introduction --- p.1-1Chapter 1.2 --- Introduction to asynchronous system --- p.1-5Chapter 1.2.1 --- Motivation --- p.1-5Chapter 1.2.2 --- Hazards --- p.1-7Chapter 1.2.3 --- Classes of Asynchronous circuits --- p.1-8Chapter 1.3 --- Introduction to Transform Coding --- p.1-9Chapter 1.4 --- Organization of the Thesis --- p.1-16Chapter 2. --- Asynchronous Design Methodologies --- p.2-1Chapter 2.1 --- Introduction --- p.2-1Chapter 2.2 --- Self-timed system --- p.2-2Chapter 2.3 --- DCVSL Methodology --- p.2-4Chapter 2.3.1 --- DCVSL gate --- p.2-5Chapter 2.3.2 --- Handshake Control --- p.2-7Chapter 2.4 --- Micropipeline Methodology --- p.2-11Chapter 2.4.1 --- Summary of previous design --- p.2-12Chapter 2.4.2 --- New Micropipeline structure and improvements --- p.2-17Chapter 2.4.2.1 --- Asymmetrical delay --- p.2-20Chapter 2.4.2.2 --- Variable Delay and Delay Value Selection --- p.2-22Chapter 2.5 --- Comparison between DCVSL and Micropipeline --- p.2-25Chapter 3. --- Self-timed Multipliers --- p.3-1Chapter 3.1 --- Introduction --- p.3-1Chapter 3.2 --- Design Example 1 : Bit-serial matrix multiplier --- p.3-3Chapter 3.2.1 --- DCVSL design --- p.3-4Chapter 3.2.2 --- Micropipeline design --- p.3-4Chapter 3.2.3 --- The first test chip --- p.3-5Chapter 3.2.4 --- Second test chip --- p.3-7Chapter 3.3 --- Design Example 2 - Modified Booth's Multiplier --- p.3-9Chapter 3.3.1 --- Circuit Design --- p.3-10Chapter 3.3.2 --- Simulation result --- p.3-12Chapter 3.3.3 --- The third test chip --- p.3-14Chapter 4. --- Current-Sensing Completion Detection --- p.4-1Chapter 4.1 --- Introduction --- p.4-1Chapter 4.2 --- Current-sensor --- p.4-2Chapter 4.2.1 --- Constant current source --- p.4-2Chapter 4.2.2 --- Current mirror --- p.4-4Chapter 4.2.3 --- Current comparator --- p.4-5Chapter 4.3 --- Self-timed logic using CSCD --- p.4-9Chapter 4.4 --- CSCD test chips and testing results --- p.4-10Chapter 4.4.1 --- Test result --- p.4-11Chapter 5. --- Self-timed ICT processor architecture --- p.5-1Chapter 5.1 --- Introduction --- p.5-1Chapter 5.2 --- Comparison of different architecture --- p.5-3Chapter 5.2.1 --- General purpose Digital Signal Processor --- p.5-5Chapter 5.2.1.1 --- Hardware and speed estimation : --- p.5-6Chapter 5.2.2 --- Micropipeline without fast algorithm --- p.5-7Chapter 5.2.2.1 --- Hardware and speed estimation : --- p.5-8Chapter 5.2.3 --- Micropipeline with fast algorithm (I) --- p.5-8Chapter 5.2.3.1 --- Hardware and speed estimation : --- p.5-9Chapter 5.2.4 --- Micropipeline with fast algorithm (II) --- p.5-10Chapter 5.2.4.1 --- Hardware and speed estimation : --- p.5-11Chapter 6. --- Implementation of self-timed ICT processor --- p.6-1Chapter 6.1 --- Introduction --- p.6-1Chapter 6.2 --- Implementation of Self-timed 2-D ICT processor (First version) --- p.6-3Chapter 6.2.1 --- 1-D ICT module --- p.6-4Chapter 6.2.2 --- Self-timed Transpose memory --- p.6-5Chapter 6.2.3 --- Layout Design --- p.6-8Chapter 6.3 --- Implementation of Self-timed 1-D ICT processor with fast algorithm (final version) --- p.6-9Chapter 6.3.1 --- I/O buffers and control units --- p.6-10Chapter 6.3.1.1 --- Input control --- p.6-11Chapter 6.3.1.2 --- Output control --- p.6-12Chapter 6.3.1.2.1 --- Self-timed Computational Block --- p.6-13Chapter 6.3.1.3 --- Handshake Control Unit --- p.6-14Chapter 6.3.1.4 --- Integer Execution Unit (IEU) --- p.6-18Chapter 6.3.1.5 --- Program memory and Instruction decoder --- p.6-20Chapter 6.3.2 --- Layout Design --- p.6-21Chapter 6.4 --- Specifications of the final version self-timed ICT chip --- p.6-22Chapter 7. --- Testing of Self-timed ICT processor --- p.7-1Chapter 7.1 --- Introduction --- p.7-1Chapter 7.2 --- Pin assignment of Self-timed 1 -D ICT chip --- p.7-2Chapter 7.3 --- Simulation --- p.7-3Chapter 7.4 --- Testing of Self-timed 1-D ICT processor --- p.7-5Chapter 7.4.1 --- Functional test --- p.7-5Chapter 7.4.1.1 --- Testing environment and results --- p.7-5Chapter 7.4.2 --- Transient Characteristics --- p.7-7Chapter 7.4.3 --- Comments on speed and power --- p.7-10Chapter 7.4.4 --- Determination of optimum delay control voltage --- p.7-12Chapter 7.5 --- Testing of delay element and other logic cells --- p.7-13Chapter 8. --- Conclusions --- p.8-1BibliographyAppendice

    Core interface optimization for multi-core neuromorphic processors

    Full text link
    Hardware implementations of Spiking Neural Networks (SNNs) represent a promising approach to edge-computing for applications that require low-power and low-latency, and which cannot resort to external cloud-based computing services. However, most solutions proposed so far either support only relatively small networks, or take up significant hardware resources, to implement large networks. To realize large-scale and scalable SNNs it is necessary to develop an efficient asynchronous communication and routing fabric that enables the design of multi-core architectures. In particular the core interface that manages inter-core spike communication is a crucial component as it represents the bottleneck of Power-Performance-Area (PPA) especially for the arbitration architecture and the routing memory. In this paper we present an arbitration mechanism with the corresponding asynchronous encoding pipeline circuits, based on hierarchical arbiter trees. The proposed scheme reduces the latency by more than 70% in sparse-event mode, compared to the state-of-the-art arbitration architectures, with lower area cost. The routing memory makes use of asynchronous Content Addressable Memory (CAM) with Current Sensing Completion Detection (CSCD), which saves approximately 46% energy, and achieves a 40% increase in throughput against conventional asynchronous CAM using configurable delay lines, at the cost of only a slight increase in area. In addition as it radically reduces the core interface resources in multi-core neuromorphic processors, the arbitration architecture and CAM architecture we propose can be also applied to a wide range of general asynchronous circuits and systems

    Practical advances in asynchronous design

    Get PDF
    Journal ArticleRecent practical advances in asynchronous circuit and system design have resulted in renewed interest by circuit designers. Asynchronous systems are being viewed as in increasingly viable alternative to globally synchronous system organization. This tutorial will present the current state of the art in asynchronous circuit and system design in three different areas. The first section details asynchronous control systems. The second describes a variety of approaches to asynchronous datapaths. The third section is on asynchronous and self-timed circuits applied to the design of general purpose processors

    An integrated soft- and hard-programmable multithreaded architecture

    Get PDF

    Dynamically reconfigurable asynchronous processor

    Get PDF
    The main design requirements for today's mobile applications are: · high throughput performance. · high energy efficiency. · high programmability. Until now, the choice of platform has often been limited to Application-Specific Integrated Circuits (ASICs), due to their best-of-breed performance and power consumption. The economies of scale possible with these high-volume markets have traditionally been able to hide the high Non-Recurring Engineering (NRE) costs required for designing and fabricating new ASICs. However, with the NREs and design time escalating with each generation of mobile applications, this practice may be reaching its limit. Designers today are looking at programmable solutions, so that they can respond more rapidly to changes in the market and spread costs over several generations of mobile applications. However, there have been few feasible alternatives to ASICs: Digital Signals Processors (DSPs) and microprocessors cannot meet the throughput requirements, whereas Field-Programmable Gate Arrays (FPGAs) require too much area and power. Coarse-grained dynamically reconfigurable architectures offer better solutions for high throughput applications, when power and area considerations are taken into account. One promising example is the Reconfigurable Instruction Cell Array (RICA). RICA consists of an array of cells with an interconnect that can be dynamically reconfigured on every cycle. This allows quite complex datapaths to be rendered onto the fabric and executed in a single configuration - making these architectures particularly suitable to stream processing. Furthermore, RICA can be programmed from C, making it a good fit with existing design methodologies. However the RICA architecture has a drawback: poor scalability in terms of area and power. As the core gets bigger, the number of sequential elements in the array must be increased significantly to maintain the ability to achieve high throughputs through pipelining. As a result, a larger clock tree is required to synchronise the increased number of sequential elements. The clock tree therefore takes up a larger percentage of the area and power consumption of the core. This thesis presents a novel Dynamically Reconfigurable Asynchronous Processor (DRAP), aimed at high-throughput mobile applications. DRAP is based on the RICA architecture, but uses asynchronous design techniques - methods of designing digital systems without clocks. The absence of a global clock signal makes DRAP more scalable in terms of power and area overhead than its synchronous counterpart. The DRAP architecture maintains most of the benefits of custom asynchronous design, whilst also providing programmability via conventional high-level languages. Results show that the DRAP processor delivers considerably lower power consumption when compared to a market-leading Very Long Instruction Word (VLIW) processor and a low-power ARM processor. For example, DRAP resulted in a reduction in power consumption of 20 times compared to the ARM7 processor, and 29 times compared to the TIC64x VLIW, when running the same benchmark capped to the same throughput and for the same process technology (0.13μm). When compared to an equivalent RICA design, DRAP was up to 22% larger than RICA but resulted in a power reduction of up to 1.9 times. It was also capable of achieving up to 2.8 times higher throughputs than RICA for the same benchmarks
    corecore