Search CORE

417 research outputs found

High Peformance and Low Power On-Die Interconnect Fabrics.

Author: Satpathy Sudhir Kumar
Publication venue
Publication date
Field of study

Increasing power density with technology scaling has caused stagnation in operating frequency of modern day microprocessors. This has led designers to prefer multicore architectures over complex monolithic processors to keep up with the demand for rising computing throughput. Although processing units are getting smaller and simpler, the dramatic rise of their count on a single die has made the fabric that connects these processing units increasingly complex. These interconnect fabrics have become a bottleneck in improving overall system effciency. As a result, the design paradigm for multi-core chips is gradually shifting from a core-centric architecture towards an interconnect-centric architecture, where system efficiency is limited by the fabric rather than the processing ability of any individual core. This dissertation introduces three novel and synergistic circuit techniques to improve scalability of switch fabrics to make on-die integration of hundreds to thousands of cores feasible. 1) A matrix topology is proposed for designing a fully connected switch fabric that re-uses output buses for programming, and stores shue congurations at cross points. This significantly reduces routing congestion, lowers area/power, and improves per- formance. Silicon measurements demonstrate 47% energy savings in a 64-lane SIMD processor fabricated in 65nm CMOS over a conventional implementation. 2) A novel approach to handle high radix arbitration along with data routing is proposed. It optimally uses existing cross-bar interconnect resources without requiring any additional overhead. Bandwidth exceeding 2Tb/s is recorded in a test prototype fabricated in 65nm. 3) Building on the later, a new circuit topology to manage and update priority adaptively within the switch fabric without incurring additional delay or area is then proposed. Several assist circuit techniques, such as a thyristor based sense amplifier and self regenerating bi-directional repeaters are proposed for high speed energy efficient signaling to and from the switch fabric to improve overall routing efficiency. Using these techniques a 64 x 64 switch fabric with 128b data bus fabricated in 45nm achieves a throughput of 4.5Tb/s at single cycle latency while operating at 559MHz.Ph.D.Electrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/91506/1/sudhirks_1.pd

Deep Blue Documents at the University of Michigan

Synthesis Of Self-resetting Stage Logic Pipelines

Author: Oreifej Rashad
Publication venue: 'Information Bulletin on Variable Stars (IBVS)'
Publication date: 01/01/2006
Field of study

As designers began to pack multi-million transistors onto a single chip, their reliance on a global clocking signal to orchestrate the operations of the chip has started to face almost insurmountable difficulties. As a result, designers started to explore clockless circuits to avoid the global clocking problem. Recently, self-resetting circuits implemented in dynamic logic families have been proposed as viable clockless alternatives. While these circuits can produce excellent performances, they display serious limitations in terms of area cost and power consumption. A middle-of-the-road alternative, which can provide a good performance and avoid the limitations seen in dynamic self-resetting circuits, would be to implement self-resetting behavior in static circuits. This alternative has been introduced recently as Self-Resetting Stage Logic and used to propose three types of clockless pipelines. Experimental studies show that these pipelines have the potential to produce high throughputs with a minimum area overhead if a suitable synthesis methodology is available. This thesis proposes a novel synthesis methodology to design and verify clockless pipelines implemented in SRSL by taking advantage of the maturity of current CAD tools. This methodology formulates the synthesis problem as a combinatorial analytical problem for which a run-time efficient exact solution is difficult to derive. Consequently, a two-phase algorithm is proposed to synthesize these pipelines from gate netlists subject to user-specified constraints. The first phase is a heuristic based on the as-soon-as-possible scheduling strategy in which each gate of the netlist is assigned to a single pipeline stage without violating the period constraint of each pipeline stage. On the other hand, the second phase consists of a heuristic, based on the Kernighan-Lin partitioning strategy, to minimize the number of nets crossing each pair of adjacent pipeline stages. The objective of this optimization is to reduce the number of latches separating pipeline stages since these latches tend to occupy large areas. Experiments conducted on a prototype of the synthesis algorithm reveal that these self-resetting stage logic pipelines can easily reach throughputs higher than 1 GHz. Furthermore, these experiments reveal that the area overhead needed to implement the self-resetting circuitry of these pipelines can be easily amortized over the area of the logic embedded in the pipeline stages. In the overall, the synthesis methods developed for SRSL produce low area overhead pipelines for wide and deep gate netlists while it tends to produce high throughput pipelines for wide and shallow gate netlists. This shows that these pipelines are mostly suitable for coarse-grain datapaths

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

Reliable Low-Latency and Low-Complexity Viterbi Architectures Benchmarked on ASIC and FPGA

Author: Singh Vineeta Pannalal
Publication venue: RIT Scholar Works
Publication date: 01/08/2015
Field of study

The Viterbi algorithm is commonly applied in a number of sensitive usage models including decoding convolutional codes used in communications such as satellite communication, cellular relay, and wireless local area networks. Moreover, the algorithm has been applied to automatic speech recognition and storage devices. In this thesis, efficient error detection schemes for architectures based on low-latency, low-complexity Viterbi decoders are presented. The merit of the proposed schemes is that reliability requirements, overhead tolerance, and performance degradation limits are embedded in the structures and can be adapted accordingly. We also present three variants of recomputing with encoded operands and its modifications to detect both transient and permanent faults, coupled with signature-based schemes. The instrumented decoder architecture has been subjected to extensive error detection assessments through simulations, and application-specific integrated circuit (ASIC) [32nm library] and field-programmable gate array (FPGA) [Xilinx Virtex-6 family] implementations for benchmark. The proposed fine-grained approaches can be utilized based on reliability objectives and performance/implementation metrics degradation tolerance

RIT Scholar Works

One way Doppler extractor. Volume 1: Vernier technique

Author: Blasco R. W.
Klein S.
Nossen E. J.
Starner E. R.
Yanosov J. A.
Publication venue
Publication date
Field of study

A feasibility analysis, trade-offs, and implementation for a One Way Doppler Extraction system are discussed. A Doppler error analysis shows that quantization error is a primary source of Doppler measurement error. Several competing extraction techniques are compared and a Vernier technique is developed which obtains high Doppler resolution with low speed logic. Parameter trade-offs and sensitivities for the Vernier technique are analyzed, leading to a hardware design configuration. A detailed design, operation, and performance evaluation of the resulting breadboard model is presented which verifies the theoretical performance predictions. Performance tests have verified that the breadboard is capable of extracting Doppler, on an S-band signal, to an accuracy of less than 0.02 Hertz for a one second averaging period. This corresponds to a range rate error of no more than 3 millimeters per second

NASA Technical Reports Server

VLSI analogs of neuronal visual processing: a synthesis of form and function

Author: Mahowald Misha
Publication venue: 'California Institute of Technology Library'
Publication date: 01/01/1992
Field of study

This thesis describes the development and testing of a simple visual system fabricated using complementary metal-oxide-semiconductor (CMOS) very large scale integration (VLSI) technology. This visual system is composed of three subsystems. A silicon retina, fabricated on a single chip, transduces light and performs signal processing in a manner similar to a simple vertebrate retina. A stereocorrespondence chip uses bilateral retinal input to estimate the location of objects in depth. A silicon optic nerve allows communication between chips by a method that preserves the idiom of action potential transmission in the nervous system. Each of these subsystems illuminates various aspects of the relationship between VLSI analogs and their neurobiological counterparts. The overall synthetic visual system demonstrates that analog VLSI can capture a significant portion of the function of neural structures at a systems level, and concomitantly, that incorporating neural architectures leads to new engineering approaches to computation in VLSI. The relationship between neural systems and VLSI is rooted in the shared limitations imposed by computing in similar physical media. The systems discussed in this text support the belief that the physical limitations imposed by the computational medium significantly affect the evolving algorithm. Since circuits are essentially physical structures, I advocate the use of analog VLSI as powerful medium of abstraction, suitable for understanding and expressing the function of real neural systems. The working chip elevates the circuit description to a kind of synthetic formalism. The behaving physical circuit provides a formal test of theories of function that can be expressed in the language of circuits

Caltech Authors

CERN Document Server

Caltech Theses and Dissertations

VLSI architectures for public key cryptology

Author: Tomlinson Allan
Publication venue: The University of Edinburgh
Publication date: 01/01/1991
Field of study

Edinburgh Research Archive

Recommended from our members

Contributions to the Design of Asynchronous Macromodular Systems

Author: Plana Luis A.
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/1999
Field of study

In this thesis, I advocate the use of macromodules to design and build robust and performance-competitive asynchronous systems. The contributions of the work relate to different aspects of the design of asynchronous macromodular systems. First, an architectural optimization for 4-phase systems is introduced. The goal of the optimization is to increase the performance of a system by increasing the level of concurrent activity in the sequencing of data processing stages. In particular, three new asynchronous sequencers are designed, which increase the throughput of the system. Existing asynchronous data paths do not operate correctly at this increased level of concurrency: data hazards may result. Interlock mechanisms are introduced to insure correct operation. The technique can also be regarded as a low-power optimization: The increased throughput can be traded for a significant reduction in the power consumption of the entire system. SPICE simulation results show that the new sequencers allow roughly twice the throughput of non-concurrent sequencers. The simulations also show that, after voltage scaling, energy dissipation is reduced by a factor of 2.5. Second, the use of pulses for efficient inter-module synchronization is introduced. The idea is complemented with the definition of a pulse-mode handshake protocol and the characterization of Pulse-Burst Operation (PBO), an important extension to traditional pulse-mode operation. Also, a basic set of macromodules, that efficiently implement control operations such as sequencing, selection, iteration, concurrency control, resource sharing, and arbitration is presented. Modules for interfacing pulse-mode circuits with traditional 2-phase and 4-phase circuits are also included in the set. Finally, the design of a packet switch is used to demonstrate the viability of pulse-mode macromodules to implement complex, high performance systems. The switch organization, its asynchronous operation, and the low control overhead introduced by pulse-mode macromodules result in a design that can handle 2.4 times the target throughput of 155 Mbits/Sec. Also, the switch is characterized by very low input-to-output latency. These results suggest that pulse-mode macromodules can keep control overhead low without introducing complex, unsafe timing considerations, two necessary conditions to achieve robust, performance-competitive systems

Columbia University Academic Commons

Design of an asynchronous processor

Author: Sotiriou Christos Panagiotis
Publication venue: The University of Edinburgh
Publication date: 01/01/2001
Field of study

Edinburgh Research Archive

The Fifth NASA Symposium on VLSI Design

Author
Publication venue
Publication date
Field of study

The fifth annual NASA Symposium on VLSI Design had 13 sessions including Radiation Effects, Architectures, Mixed Signal, Design Techniques, Fault Testing, Synthesis, Signal Processing, and other Featured Presentations. The symposium provides insights into developments in VLSI and digital systems which can be used to increase data systems performance. The presentations share insights into next generation advances that will serve as a basis for future VLSI design

NASA Technical Reports Server