31 research outputs found
Asynchronous techniques for system-on-chip design
SoC design will require asynchronous techniques as the large parameter variations across the chip will make it impossible to control delays in clock networks and other global signals efficiently. Initially, SoCs will be globally asynchronous and locally synchronous (GALS). But the complexity of the numerous asynchronous/synchronous interfaces required in a GALS will eventually lead to entirely asynchronous solutions. This paper introduces the main design principles, methods, and building blocks for asynchronous VLSI systems, with an emphasis on communication and synchronization. Asynchronous circuits with the only delay assumption of isochronic forks are called quasi-delay-insensitive (QDI). QDI is used in the paper as the basis for asynchronous logic. The paper discusses asynchronous handshake protocols for communication and the notion of validity/neutrality tests, and completion tree. Basic building blocks for sequencing, storage, function evaluation, and buses are described, and two alternative methods for the implementation of an arbitrary computation are explained. Issues of arbitration, and synchronization play an important role in complex distributed systems and especially in GALS. The two main asynchronous/synchronous interfaces needed in GALS-one based on synchronizer, the other on stoppable clock-are described and analyzed
Autonomous Probabilistic Coprocessing with Petaflips per Second
In this paper we present a concrete design for a probabilistic (p-) computer
based on a network of p-bits, robust classical entities fluctuating between -1
and +1, with probabilities that are controlled through an input constructed
from the outputs of other p-bits. The architecture of this probabilistic
computer is similar to a stochastic neural network with the p-bit playing the
role of a binary stochastic neuron, but with one key difference: there is no
sequencer used to enforce an ordering of p-bit updates, as is typically
required. Instead, we explore \textit{sequencerless} designs where all p-bits
are allowed to flip autonomously and demonstrate that such designs can allow
ultrafast operation unconstrained by available clock speeds without
compromising the solution's fidelity. Based on experimental results from a
hardware benchmark of the autonomous design and benchmarked device models, we
project that a nanomagnetic implementation can scale to achieve petaflips per
second with millions of neurons. A key contribution of this paper is the focus
on a hardware metric flips per second as a problem and
substrate-independent figure-of-merit for an emerging class of hardware
annealers known as Ising Machines. Much like the shrinking feature sizes of
transistors that have continually driven Moore's Law, we believe that flips per
second can be continually improved in later technology generations of a wide
class of probabilistic, domain specific hardware.Comment: 13 pages, 8 figures, 1 tabl
Design And Synthesis Of Clockless Pipelines Based On Self-resetting Stage Logic
For decades, digital design has been primarily dominated by clocked circuits. With larger scales of integration made possible by improved semiconductor manufacturing techniques, relying on a clock signal to orchestrate logic operations across an entire chip became increasingly difficult. Motivated by this problem, designers are currently considering circuits which can operate without a clock. However, the wide acceptance of these circuits by the digital design community requires two ingredients: (i) a unified design methodology supported by widely available CAD tools, and (ii) a granularity of design techniques suitable for synthesizing large designs. Currently, there is no unified established design methodology to support the design and verification of these circuits. Moreover, the majority of clockless design techniques is conceived at circuit level, and is subsequently so fine-grain, that their application to large designs can have unacceptable area costs. Given these considerations, this dissertation presents a new clockless technique, called self-resetting stage logic (SRSL), in which the computation of a block is reset periodically from within the block itself. SRSL is used as a building block for three coarse-grain pipelining techniques: (i) Stage-controlled self-resetting stage logic (S-SRSL) Pipelines: In these pipelines, the control of the communication between stages is performed locally between each pair of stages. This communication is performed in a uni-directional manner in order to simplify its implementation. (ii) Pipeline-controlled self-resetting stage logic (P-SRSL) Pipelines: In these pipelines, the communication between each pair of stages in the pipeline is driven by the oscillation of the last pipeline stage. Their communication scheme is identical to the one used in S-SRSL pipelines. (iii) Delay-tolerant self-resetting stage logic (D-SRSL) Pipelines: While communication in these pipelines is local in nature in a manner similar to the one used in S-SRL pipelines, this communication is nevertheless extended in both directions. The result of this bi-directional approach is an increase in the capability of the pipeline to handle stages with random delay. Based on these pipelining techniques, a new design methodology is proposed to synthesize clockless designs. The synthesis problem consists of synthesizing an SRSL pipeline from a gate netlist with a minimum area overhead given a specified data rate. A two-phase heuristic algorithm is proposed to solve this problem. The goal of the algorithm is to pipeline a given datapath by minimizing the area occupied by inter-stage latches without violating any timing constraints. Experiments with this synthesis algorithm show that while P-SRSL pipelines can reach high throughputs in shallow pipelines, D-SRSL pipelines can achieve comparable throughputs in deeper pipelines
Scalable Emulation of Sign-ProblemFree Hamiltonians with Room Temperature p-bits
The growing field of quantum computing is based on the concept of a q-bit
which is a delicate superposition of 0 and 1, requiring cryogenic temperatures
for its physical realization along with challenging coherent coupling
techniques for entangling them. By contrast, a probabilistic bit or a p-bit is
a robust classical entity that fluctuates between 0 and 1, and can be
implemented at room temperature using present-day technology. Here, we show
that a probabilistic coprocessor built out of room temperature p-bits can be
used to accelerate simulations of a special class of quantum many-body systems
that are sign-problemfree or stoquastic, leveraging the well-known
Suzuki-Trotter decomposition that maps a -dimensional quantum many body
Hamiltonian to a +1-dimensional classical Hamiltonian. This mapping allows
an efficient emulation of a quantum system by classical computers and is
commonly used in software to perform Quantum Monte Carlo (QMC) algorithms. By
contrast, we show that a compact, embedded MTJ-based coprocessor can serve as a
highly efficient hardware-accelerator for such QMC algorithms providing several
orders of magnitude improvement in speed compared to optimized CPU
implementations. Using realistic device-level SPICE simulations we demonstrate
that the correct quantum correlations can be obtained using a classical
p-circuit built with existing technology and operating at room temperature. The
proposed coprocessor can serve as a tool to study stoquastic quantum many-body
systems, overcoming challenges associated with physical quantum annealers.Comment: Fixed minor typos and expanded Appendi
Recommended from our members
Investigation of Energy-Efficient Hybrid Analog/Digital Approximate Computation in Continuous Time
This work investigates energy-efficient approximate computation for solving differential equations. It extends the analog computing techniques to a new paradigm: continuous-time hybrid computation, where both analog and digital circuits operate in continuous time. In this approach, the time intervals in the digital signals contain important information. Unlike conventional synchronous digital circuits, continuous-time digital signals offer the benefits of adaptive power dissipation and no quantization noise.
Two prototype chips have been fabricated in 65 nm CMOS technology and tested successfully. The first chip is capable of solving nonlinear differential equations up to 4th order, and the second chip scales up to 16th order based on the first chip. Nonlinear functions are generated by a programmable, clockless, continuous-time 8-bit hybrid architecture (ADC+SRAM+DAC). Digitally-assisted calibration is used in all analog/mixed-signal blocks. Compared to the prior art, our chips makes possible arbitrary nonlinearities and achieves 16 times lower power dissipation, thanks to technology scaling and extensive use of class-AB analog blocks.
Typically, the unit achieves a computational accuracy of about 0.5% to 5% RMS, solution times from a fraction of 1 micro second to several hundred micro seconds, and total computational energy from a fraction of 1 nJ to hundreds of nJ, depending on equation details. Very significant advantages are observed in computational speed and energy (over two orders of magnitude and over one order of magnitude, respectively) compared to those obtained with a modern MSP430 microcontroller for the same RMS error
Resource-Constrained Acquisition Circuits for Next Generation Neural Interfaces
The development of neural interfaces allowing the acquisition of signals from the cortex of the brain has seen an increasing amount of interest both in academic research as well as in the commercial space due to their ability to aid people with various medical conditions, such as spinal cord injuries, as well as their potential to allow more seamless interactions between people and machines. While it has already been demonstrated that neural implants can allow tetraplegic patients to control robotic arms, thus to an extent returning some motoric function, the current state of the art often involves the use of heavy table-top instruments connected by wires passing through the patient’s skull, thus making the applications impractical and chronically infeasible.
Those limitations are leading to the development of the next generation of neural interfaces that will overcome those issues by being minimal in size and completely wireless, thus paving a way to the possibility of their chronic application. Their development however faces several challenges in numerous aspects of engineering due to constraints presented by their minimal size, amount of power available as well as the materials that can be utilised.
The aim of this work is to explore some of those challenges and investigate novel circuit techniques that would allow the implementation of acquisition analogue front-ends under the presented constraints. This is facilitated by first giving an overview of the problematic of recording electrodes and their electrical characterisation in terms of their impedance profile and added noise that can be used to guide the design of analogue front-ends.
Continuous time (CT) acquisition is then investigated as a promising signal digitisation technique alternative to more conventional methods in terms of its suitability. This is complemented by a description of practical implementations of a CT analogue-to-digital converter (ADC) including a novel technique of clockless stochastic chopping aimed at the suppression of flicker noise that commonly affects the acquisition of low-frequency signals. A compact design is presented, implementing a 450 nW, 5.5 bit ENOB CT ADC, occupying an area of 0.0288 mm2 in a 0.18 μm CMOS technology, making this the smallest presented design in literature to the best of our knowledge.
As completely wireless neural implants rely on power delivered through wireless links, their supply voltage is often subject to large high frequency variations as well voltage uncertainty making it necessary to design reference circuits and voltage regulators providing stable reference voltage and supply in the constrained space afforded to them. This results in numerous challenges that are explored and a design of a practical implementation of a reference circuit and voltage regulator is presented. Two designs in a 0.35 μm CMOS technology are presented, showing respectively a measured PSRR of ≈60 dB and ≈53 dB at DC and a worst-case PSRR of ≈42 dB and ≈33 dB with a less than 1% standard deviation in the output reference voltage of 1.2 V while consuming a power of ≈7 μW.
Finally, ΣΔ modulators are investigated for their suitability in neural signal acquisition chains, their properties explained and a practical implementation of a ΣΔ DC-coupled neural acquisition circuit presented. This implements a 10-kHz, 40 dB SNDR ΣΔ analogue front-end implemented in a 0.18 μm CMOS technology occupying a compact area of 0.044 μm2 per channel while consuming 31.1 μW per channel.Open Acces
Recommended from our members
Hybrid Analog-Digital Co-Processing for Scientific Computation
In the past 10 years computer architecture research has moved to more heterogeneity and less adherence to conventional abstractions. Scientists and engineers hold an unshakable belief that computing holds keys to unlocking humanity's Grand Challenges. Acting on that belief they have looked deeper into computer architecture to find specialized support for their applications. Likewise, computer architects have looked deeper into circuits and devices in search of untapped performance and efficiency. The lines between computer architecture layers---applications, algorithms, architectures, microarchitectures, circuits and devices---have blurred. Against this backdrop, a menagerie of computer architectures are on the horizon, ones that forgo basic assumptions about computer hardware, and require new thinking of how such hardware supports problems and algorithms.
This thesis is about revisiting hybrid analog-digital computing in support of diverse modern workloads. Hybrid computing had extensive applications in early computing history, and has been revisited for small-scale applications in embedded systems. But architectural support for using hybrid computing in modern workloads, at scale and with high accuracy solutions, has been lacking.
I demonstrate solving a variety of scientific computing problems, including stochastic ODEs, partial differential equations, linear algebra, and nonlinear systems of equations, as case studies in hybrid computing. I solve these problems on a system of multiple prototype analog accelerator chips built by a team at Columbia University. On that team I made contributions toward programming the chips, building the digital interface, and validating the chips' functionality. The analog accelerator chip is intended for use in conjunction with a conventional digital host computer.
The appeal and motivation for using an analog accelerator is efficiency and performance, but it comes with limitations in accuracy and problem sizes that we have to work around.
The first problem is how to do problems in this unconventional computation model. Scientific computing phrases problems as differential equations and algebraic equations. Differential equations are a continuous view of the world, while algebraic equations are a discrete one. Prior work in analog computing mostly focused on differential equations; algebraic equations played a minor role in prior work in analog computing. The secret to using the analog accelerator to support modern workloads on conventional computers is that these two viewpoints are interchangeable. The algebraic equations that underlie most workloads can be solved as differential equations,
and differential equations are naturally solvable in the analog accelerator chip. A hybrid analog-digital computer architecture can focus on solving linear and nonlinear algebra problems to support many workloads.
The second problem is how to get accurate solutions using hybrid analog-digital computing. The reason that the analog computation model gives less accurate solutions is it gives up representing numbers as digital binary numbers, and instead uses the full range of analog voltage and current to represent real numbers. Prior work has established that encoding data in analog signals gives an energy efficiency advantage as long as the analog data precision is limited. While the analog accelerator alone may be useful for energy-constrained applications where inputs and outputs are imprecise, we are more interested in using analog in conjunction with digital for precise solutions. This thesis gives novel insight that the trick to do so is to solve nonlinear problems where low-precision guesses are useful for conventional digital algorithms.
The third problem is how to solve large problems using hybrid analog-digital computing. The reason the analog computation model can't handle large problems is it gives up step-by-step discrete-time operation, instead allowing variables to evolve smoothly in continuous time. To make that happen the analog accelerator works by chaining hardware for mathematical operations end-to-end. During computation analog data flows through the hardware with no overheads in control logic and memory accesses. The downside is then the needed hardware size grows alongside problem sizes. While scientific computing researchers have for a long time split large problems into smaller subproblems to fit in digital computer constraints, this thesis is a first attempt to consider these divide-and-conquer algorithms as an essential tool in using the analog model of computation.
As we enter the post-Moore’s law era of computing, unconventional architectures will offer specialized models of computation that uniquely support specific problem types. Two prominent examples are deep neural networks and quantum computers. Recent trends in computer science research show these unconventional architectures will soon have broad adoption. In this thesis I show another specialized, unconventional architecture is to use analog accelerators to solve problems in scientific computing. Computer architecture researchers will discover other important models of computation in the future. This thesis is an example of the discovery process, implementation, and evaluation of how an unconventional architecture supports specialized workloads
Dynamic Assertion-Based Verification for SystemC
SystemC has emerged as a de facto standard modeling language for hardware and embedded systems. However, the current standard does not provide support for temporal specifications. Specifically, SystemC lacks a mechanism for sampling the state of the model at different types of temporal resolutions, for observing the internal state of modules, and for integrating monitors efficiently into the model's execution. This work presents a novel framework for specifying and efficiently monitoring temporal assertions of SystemC models that removes these restrictions. This work introduces new specification language primitives that (1) expose the inner state of the SystemC kernel in a principled way, (2) allow for very fine control over the temporal resolution, and (3) allow sampling at arbitrary locations in the user code. An efficient modular monitoring framework presented here allows the integration of monitors into the execution of the model, while at the same time incurring low overhead and allowing for easy adoption. Instrumentation of the user code is automated using Aspect-Oriented Programming techniques, thereby allowing the integration of user-code-level sample points into the monitoring framework. While most related approaches optimize the size of the monitors, this work focuses on minimizing the runtime overhead of the monitors. Different encoding configurations are identified and evaluated empirically using monitors synthesized from a large benchmark of random and pattern temporal specifications. The framework and approaches described in this dissertation allow the adoption of assertion-based verification for SystemC models written using various levels of abstraction, from system level to register-transfer level. An advantage of this work is that many existing specification languages call be adopted to use the specification primitives described here, and the framework can easily be integrated into existing implementations of SystemC
2-wire time independent asynchronous communications
Communications both to and between low end microprocessors represents a real cost in a number of industrial and consumer products. This thesis starts by examining the properties of protocols that help to minimize these expenses and comes to the conclusion that the derived set of properties define a new category of communications protocol : Time Independent Asynchronous ( TIA) communications. To show the utility of the TIA category we develop a novel TIA protocol that uses only 2-wires and general IO pins on each host. The protocol is analyzed using the Petri net based STG ( Signal Transition Graph) which is widely use to model asynchronous logic. It is shown that STGs do not accurately model the behavior of software driven systems and so a modified form called STG-FT ( STG For Threads) is developed to better model software systems. A simulator is created to take an STG-FT model and perform a full reachability tree analysis to prove correctness and analyze livelock and deadlock properties. The simulator can also examine the full reachability tree for every possible system state ( the cross product of all sub-system states), and analyze deadlock and livelock issues related to unexpected inputs and unusual situations. Reachability pruning algorithms are developed which decrease the search tree by a factor of approximately 250 million. The 2-wire protocol is implemented between a PC and an Atmel Tiny26 microprocessor, there is also a variant that works between microprocessors. Testing verifies the simulation results including an avoidable livelock condition with data throughput peaking at a useful 50 kilobits/second in both directions. The first practical application of 2-wire TIA is part of a novel debugger for the Atmel Tiny26 microprocessor. The approach can be extended to any microprocessor with general IO pins. TIA communications, developed in this thesis, is a serious contender whenever low end microprocessors must communicate with other processors. Consumer and industrial products may be able to achieve cost saving by using this new protocol