Search CORE

4 research outputs found

Recommended from our members

Design and performance optimization of asynchronous networks-on-chip

Author: Jiang Weiwei
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2018
Field of study

As digital systems continue to grow in complexity, the design of conventional synchronous systems is facing unprecedented challenges. The number of transistors on individual chips is already in the multi-billion range, and a greatly increasing number of components are being integrated onto a single chip. As a consequence, modern digital designs are under strong time-to-market pressure, and there is a critical need for composable design approaches for large complex systems. In the past two decades, networks-on-chip (NoC’s) have been a highly active research area. In a NoC-based system, functional blocks are first designed individually and may run at different clock rates. These modules are then connected through a structured network for on-chip global communication. However, due to the rigidity of centrally-clocked NoC’s, there have been bottlenecks of system scalability, energy and performance, which cannot be easily solved with synchronous approaches. As a result, there has been significant recent interest in combing the notion of asynchrony with NoC designs. Since the NoC approach inherently separates the communication infrastructure, and its timing, from computational elements, it is a natural match for an asynchronous paradigm. Asynchronous NoC’s, therefore, enable a modular and extensible system composition for an ‘object-orient’ design style. The thesis aims to significantly advance the state-of-art and viability of asynchronous and globally-asynchronous locally-synchronous (GALS) networks-on-chip, to enable high-performance and low-energy systems. The proposed asynchronous NoC’s are nearly entirely based on standard cells, which eases their integration into industrial design flows. The contributions are instantiated in three different directions. First, practical acceleration techniques are proposed for optimizing the system latency, in order to break through the latency bottleneck in the memory interfaces of many on-chip parallel processors. Novel asynchronous network protocols are proposed, along with concrete NoC designs. A new concept, called ‘monitoring network’, is introduced. Monitoring networks are lightweight shadow networks used for fast-forwarding anticipated traffic information, ahead of the actual packet traffic. The routers are therefore allowed to initiate and perform arbitration and channel allocation in advance. The technique is successfully applied to two topologies which belong to two different categories – a variant mesh-of-trees (MoT) structure and a 2D-mesh topology. Considerable and stable latency improvements are observed across a wide range of traffic patterns, along with moderate throughput gains. Second, for the first time, a high-performance and low-power asynchronous NoC router is compared directly to a leading commercial synchronous counterpart in an advanced industrial technology. The asynchronous router design shows significant performance improvements, as well as area and power savings. The proposed asynchronous router integrates several advanced techniques, including a low-latency circular FIFO for buffer design, and a novel end-to-end credit-based virtual channel (VC) flow control. In addition, a semi-automated design flow is created, which uses portions of a standard synchronous tool flow. Finally, a high-performance multi-resource asynchronous arbiter design is developed. This small but important component can be directly used in existing asynchronous NoC’s for performance optimization. In addition, this standalone design promises use in opening up new NoC directions, as well as for general use in parallel systems. In the proposed arbiter design, the allocation of a resource to a client is divided into several steps. Multiple successive client-resource pairs can be selected rapidly in pipelined sequence, and the completion of the assignments can overlap in parallel. In sum, the thesis provides a set of advanced design solutions for performance optimization of asynchronous and GALS networks-on-chip. These solutions are at different levels, from network protocols, down to router- and component-level optimizations, which can be directly applied to existing basic asynchronous NoC designs to provide a leap in performance improvement

Columbia University Academic Commons

High sample-rate Givens rotations for recursive least squares

Author: Walke Richard Lewis
Publication venue
Publication date
Field of study

The design of an application-specific integrated circuit of a parallel array processor is considered for recursive least squares by QR decomposition using Givens rotations, applicable in adaptive filtering and beamforming applications. Emphasis is on high sample-rate operation, which, for this recursive algorithm, means that the time to perform arithmetic operations is critical. The algorithm, architecture and arithmetic are considered in a single integrated design procedure to achieve optimum results. A realisation approach using standard arithmetic operators, add, multiply and divide is adopted. The design of high-throughput operators with low delay is addressed for fixed- and floating-point number formats, and the application of redundant arithmetic considered. New redundant multiplier architectures are presented enabling reductions in area of up to 25%, whilst maintaining low delay. A technique is presented enabling the use of a conventional tree multiplier in recursive applications, allowing savings in area and delay. Two new divider architectures are presented showing benefits compared with the radix-2 modified SRT algorithm. Givens rotation algorithms are examined to determine their suitability for VLSI implementation. A novel algorithm, based on the Squared Givens Rotation (SGR) algorithm, is developed enabling the sample-rate to be increased by a factor of approximately 6 and offering area reductions up to a factor of 2 over previous approaches. An estimated sample-rate of 136 MHz could be achieved using a standard cell approach and O.35pm CMOS technology. The enhanced SGR algorithm has been compared with a CORDIC approach and shown to benefit by a factor of 3 in area and over 11 in sample-rate. When compared with a recent implementation on a parallel array of general purpose (GP) DSP chips, it is estimated that a single application specific chip could offer up to 1,500 times the computation obtained from a single OP DSP chip

Warwick Research Archives Portal Repository

An ICT image processing chip based on fast computation algorithm and self-timed circuit technique.

Author
Publication venue
Publication date: 01/01/1997
Field of study

by Johnson, Tin-Chak Pang.Thesis (M.Phil.)--Chinese University of Hong Kong, 1997.Includes bibliographical references.AcknowledgmentsAbstractList of figuresList of tablesChapter 1. --- Introduction --- p.1-1Chapter 1.1 --- Introduction --- p.1-1Chapter 1.2 --- Introduction to asynchronous system --- p.1-5Chapter 1.2.1 --- Motivation --- p.1-5Chapter 1.2.2 --- Hazards --- p.1-7Chapter 1.2.3 --- Classes of Asynchronous circuits --- p.1-8Chapter 1.3 --- Introduction to Transform Coding --- p.1-9Chapter 1.4 --- Organization of the Thesis --- p.1-16Chapter 2. --- Asynchronous Design Methodologies --- p.2-1Chapter 2.1 --- Introduction --- p.2-1Chapter 2.2 --- Self-timed system --- p.2-2Chapter 2.3 --- DCVSL Methodology --- p.2-4Chapter 2.3.1 --- DCVSL gate --- p.2-5Chapter 2.3.2 --- Handshake Control --- p.2-7Chapter 2.4 --- Micropipeline Methodology --- p.2-11Chapter 2.4.1 --- Summary of previous design --- p.2-12Chapter 2.4.2 --- New Micropipeline structure and improvements --- p.2-17Chapter 2.4.2.1 --- Asymmetrical delay --- p.2-20Chapter 2.4.2.2 --- Variable Delay and Delay Value Selection --- p.2-22Chapter 2.5 --- Comparison between DCVSL and Micropipeline --- p.2-25Chapter 3. --- Self-timed Multipliers --- p.3-1Chapter 3.1 --- Introduction --- p.3-1Chapter 3.2 --- Design Example 1 : Bit-serial matrix multiplier --- p.3-3Chapter 3.2.1 --- DCVSL design --- p.3-4Chapter 3.2.2 --- Micropipeline design --- p.3-4Chapter 3.2.3 --- The first test chip --- p.3-5Chapter 3.2.4 --- Second test chip --- p.3-7Chapter 3.3 --- Design Example 2 - Modified Booth's Multiplier --- p.3-9Chapter 3.3.1 --- Circuit Design --- p.3-10Chapter 3.3.2 --- Simulation result --- p.3-12Chapter 3.3.3 --- The third test chip --- p.3-14Chapter 4. --- Current-Sensing Completion Detection --- p.4-1Chapter 4.1 --- Introduction --- p.4-1Chapter 4.2 --- Current-sensor --- p.4-2Chapter 4.2.1 --- Constant current source --- p.4-2Chapter 4.2.2 --- Current mirror --- p.4-4Chapter 4.2.3 --- Current comparator --- p.4-5Chapter 4.3 --- Self-timed logic using CSCD --- p.4-9Chapter 4.4 --- CSCD test chips and testing results --- p.4-10Chapter 4.4.1 --- Test result --- p.4-11Chapter 5. --- Self-timed ICT processor architecture --- p.5-1Chapter 5.1 --- Introduction --- p.5-1Chapter 5.2 --- Comparison of different architecture --- p.5-3Chapter 5.2.1 --- General purpose Digital Signal Processor --- p.5-5Chapter 5.2.1.1 --- Hardware and speed estimation : --- p.5-6Chapter 5.2.2 --- Micropipeline without fast algorithm --- p.5-7Chapter 5.2.2.1 --- Hardware and speed estimation : --- p.5-8Chapter 5.2.3 --- Micropipeline with fast algorithm (I) --- p.5-8Chapter 5.2.3.1 --- Hardware and speed estimation : --- p.5-9Chapter 5.2.4 --- Micropipeline with fast algorithm (II) --- p.5-10Chapter 5.2.4.1 --- Hardware and speed estimation : --- p.5-11Chapter 6. --- Implementation of self-timed ICT processor --- p.6-1Chapter 6.1 --- Introduction --- p.6-1Chapter 6.2 --- Implementation of Self-timed 2-D ICT processor (First version) --- p.6-3Chapter 6.2.1 --- 1-D ICT module --- p.6-4Chapter 6.2.2 --- Self-timed Transpose memory --- p.6-5Chapter 6.2.3 --- Layout Design --- p.6-8Chapter 6.3 --- Implementation of Self-timed 1-D ICT processor with fast algorithm (final version) --- p.6-9Chapter 6.3.1 --- I/O buffers and control units --- p.6-10Chapter 6.3.1.1 --- Input control --- p.6-11Chapter 6.3.1.2 --- Output control --- p.6-12Chapter 6.3.1.2.1 --- Self-timed Computational Block --- p.6-13Chapter 6.3.1.3 --- Handshake Control Unit --- p.6-14Chapter 6.3.1.4 --- Integer Execution Unit (IEU) --- p.6-18Chapter 6.3.1.5 --- Program memory and Instruction decoder --- p.6-20Chapter 6.3.2 --- Layout Design --- p.6-21Chapter 6.4 --- Specifications of the final version self-timed ICT chip --- p.6-22Chapter 7. --- Testing of Self-timed ICT processor --- p.7-1Chapter 7.1 --- Introduction --- p.7-1Chapter 7.2 --- Pin assignment of Self-timed 1 -D ICT chip --- p.7-2Chapter 7.3 --- Simulation --- p.7-3Chapter 7.4 --- Testing of Self-timed 1-D ICT processor --- p.7-5Chapter 7.4.1 --- Functional test --- p.7-5Chapter 7.4.1.1 --- Testing environment and results --- p.7-5Chapter 7.4.2 --- Transient Characteristics --- p.7-7Chapter 7.4.3 --- Comments on speed and power --- p.7-10Chapter 7.4.4 --- Determination of optimum delay control voltage --- p.7-12Chapter 7.5 --- Testing of delay element and other logic cells --- p.7-13Chapter 8. --- Conclusions --- p.8-1BibliographyAppendice

CUHK Digital Repository

Bibliography of Lewis Research Center technical publications announced in 1977

Author
Publication venue
Publication date
Field of study

This compilation of abstracts describes and indexes over 780 technical reports resulting from the scientific and engineering work performed and managed by the Lewis Research Center in 1977. All the publications were announced in the 1977 issues of STAR (Scientific and Technical Aerospace Reports) and/or IAA (International Aerospace Abstracts). Documents cited include research reports, journal articles, conference presentations, patents and patent applications, and theses

NASA Technical Reports Server