612 research outputs found

    New Views for Stochastic Computing: From Time-Encoding to Deterministic Processing

    Get PDF
    University of Minnesota Ph.D. dissertation.July 2018. Major: Electrical/Computer Engineering. Advisor: David Lilja. 1 computer file (PDF); xi, 149 pages.Stochastic computing (SC), a paradigm first introduced in the 1960s, has received considerable attention in recent years as a potential paradigm for emerging technologies and ''post-CMOS'' computing. Logical computation is performed on random bitstreams where the signal value is encoded by the probability of obtaining a one versus a zero. This unconventional representation of data offers some intriguing advantages over conventional weighted binary. Implementing complex functions with simple hardware (e.g., multiplication using a single AND gate), tolerating soft errors (i.e., bit flips), and progressive precision are the primary advantages of SC. The obvious disadvantage, however, is latency. A stochastic representation is exponentially longer than conventional binary radix. Long latencies translate into high energy consumption, often higher than that of their binary counterpart. Generating bit streams is also costly. Factoring in the cost of the bit-stream generators, the overall hardware cost of an SC implementation is often comparable to a conventional binary implementation. This dissertation begins by proposing a highly unorthodox idea: performing computation with digital constructs on time-encoded analog signals. We introduce a new, energy-efficient, high-performance, and much less costly approach for SC using time-encoded pulse signals. We explore the design and implementation of arithmetic operations on time-encoded data and discuss the advantages, challenges, and potential applications. Experimental results on image processing applications show up to 99% performance speedup, 98% saving in energy dissipation, and 40% area reduction compared to prior stochastic implementations. We further introduce a low-cost approach for synthesizing sorting network circuits based on deterministic unary bit-streams. Synthesis results show more than 90% area and power savings compared to the costs of the conventional binary implementation. Time-based encoding of data is then exploited for fast and energy-efficient processing of data with the developed sorting circuits. Poor progressive precision is the main challenge with the recently developed deterministic methods of SC. We propose a high-quality down-sampling method which significantly improves the processing time and the energy consumption of these deterministic methods by pseudo-randomizing bitstreams. We also propose two novel deterministic methods of processing bitstreams by using low-discrepancy sequences. We further introduce a new advantage to SC paradigm-the skew tolerance of SC circuits. We exploit this advantage in developing polysynchronous clocking, a design strategy for optimizing the clock distribution network of SC systems. Finally, as the first study of its kind to the best of our knowledge, we rethink the memory system design for SC. We propose a seamless stochastic system, StochMem, which features analog memory to trade the energy and area overhead of data conversion for computation accuracy

    A 28-nm CMOS 1 V 3.5 GS/s 6-bit DAC with signal-independent delta-I noise DfT scheme

    Get PDF
    This paper presents a 3.5GSps 6-bit current-steering DAC with auxiliary circuitry to assist testing in a 1V digital 28nm CMOS process. The DAC uses only thin-oxide transistors and occupies 0.035mm2, making it suitable to embedding in VLSI systems, e.g. FPGA. To cope with the IC process variability, a unit element approach is generally employed. The 3 MSBs are implemented as 7 unary D/A cells and the 3 LSBs as 3 binary D/A cells, using appropriately reduced number of unit elements. Furthermore, all digital gates only make use of two basic unit blocks: a buffer and a multiplexer. For testing, a memory block of 5kbits is placed on-chip, which is externally loaded in a serial way but internally read in an 8x time-interleaved way. The memory is organized around 48 clocked 104-bit shift-registers. It keeps the resulting switching disturbances signal-independent and hence avoids inducing output non-linearity errors, even when a common power supply is shared with the DAC. This novelty allows reliable testing of the DAC core, while avoiding performance limitation risks of handling high-speed off-chip data streams. The DAC SFDR>40dB bandwidth is 0.8GHz, while the IM

    A 28-nm CMOS 1 V 3.5 GS/s 6-bit DAC With Signal-Independent Delta-I Noise DfT Scheme

    Full text link

    A survey of network-based hardware accelerators

    Get PDF
    Many practical data-processing algorithms fail to execute efficiently on general-purpose CPUs (Central Processing Units) due to the sequential matter of their operations and memory bandwidth limitations. To achieve desired performance levels, reconfigurable (FPGA (Field-Programmable Gate Array)-based) hardware accelerators are frequently explored that permit the processing units’ architectures to be better adapted to the specific problem/algorithm requirements. In particular, network-based data-processing algorithms are very well suited to implementation in reconfigurable hardware because several data-independent operations can easily and naturally be executed in parallel over as many processing blocks as actually required and technically possible. GPUs (Graphics Processing Units) have also demonstrated good results in this area but they tend to use significantly more power than FPGA, which could be a limiting factor in embedded applications. Moreover, GPUs employ a Single Instruction, Multiple Threads (SIMT) execution model and are therefore optimized to SIMD (Single Instruction, Multiple Data) operations, while in FPGAs fully custom datapaths can be built, eliminating much of the control overhead. This review paper aims to analyze, compare, and discuss different approaches to implementing network-based hardware accelerators in FPGA and programmable SoC (Systems-on-Chip). The performed analysis and the derived recommendations would be useful to hardware designers of future network-based hardware accelerators.publishe

    Analysis and comparison of different approaches to implementing a network-based parallel data processing algorithm

    Get PDF
    It is well known that network-based parallel data processing algorithms are well suited to implementation in reconfigurable hardware recurring to either Field-Programmable Gate Arrays (FPGA) or Programmable Systems-on-Chip (PSoC). The intrinsic parallelism of these devices makes it possible to execute several data-independent network operations in parallel. However, the approaches to designing the respective systems vary significantly with the experience and background of the engineer in charge. In this paper, we analyze and compare the pros and cons of using an embedded processor, high-level synthesis methods, and register-transfer low-level design in terms of design effort, performance, and power consumption for implementing a parallel algorithm to find the two smallest values in a dataset. This problem is easy to formulate, has a number of practical applications (for instance, in low-density parity check decoders), and is very well suited to parallel implementation based on comparator networks.publishe
    • …
    corecore