2,325 research outputs found

    Asynchronous Circuit Stacking for Simplified Power Management

    Get PDF
    As digital integrated circuits (ICs) continue to increase in complexity, new challenges arise for designers. Complex ICs are often designed by incorporating multiple power domains therefore requiring multiple voltage converters to produce the corresponding supply voltages. These converters not only take substantial on-chip layout area and/or off-chip space, but also aggregate the power loss during the voltage conversions that must occur fast enough to maintain the necessary power supplies. This dissertation work presents an asynchronous Multi-Threshold NULL Convention Logic (MTNCL) “stacked” circuit architecture that alleviates this problem by reducing the number of voltage converters needed to supply the voltage the ICs operate at. By stacking multiple MTNCL circuits between power and ground, supplying a multiple of VDD to the entire stack and incorporating simple control mechanisms, the dynamic range fluctuation problem can be mitigated. A 130nm Bulk CMOS process and a 32nm Silicon-on-Insulator (SOI) CMOS process are used to evaluate the theoretical effect of stacking different circuitry while running different workloads. Post parasitic physical implementations are then carried out in the 32nm SOI process for demonstrating the feasibility and analyzing the advantages of the proposed MTNCL stacking architecture

    An Energy-Efficient Near-Data Processing Accelerator for DNNs that Optimizes Data Accesses

    Full text link
    The constant growth of DNNs makes them challenging to implement and run efficiently on traditional compute-centric architectures. Some accelerators have attempted to add more compute units and on-chip buffers to solve the memory wall problem without much success, and sometimes even worsening the issue since more compute units also require higher memory bandwidth. Prior works have proposed the design of memory-centric architectures based on the Near-Data Processing (NDP) paradigm. NDP seeks to break the memory wall by moving the computations closer to the memory hierarchy, reducing the data movements and their cost as much as possible. The 3D-stacked memory is especially appealing for DNN accelerators due to its high-density/low-energy storage and near-memory computation capabilities to perform the DNN operations massively in parallel. However, memory accesses remain as the main bottleneck for running modern DNNs efficiently. To improve the efficiency of DNN inference we present QeiHaN, a hardware accelerator that implements a 3D-stacked memory-centric weight storage scheme to take advantage of a logarithmic quantization of activations. In particular, since activations of FC and CONV layers of modern DNNs are commonly represented as powers of two with negative exponents, QeiHaN performs an implicit in-memory bit-shifting of the DNN weights to reduce memory activity. Only the meaningful bits of the weights required for the bit-shift operation are accessed. Overall, QeiHaN reduces memory accesses by 25\% compared to a standard memory organization. We evaluate QeiHaN on a popular set of DNNs. On average, QeiHaN provides 4.3x4.3x speedup and 3.5x3.5x energy savings over a Neurocube-like accelerator

    FPGA based Uniform Channelizer Implementation

    Get PDF
    Channelizers are widely used in modern digital communication systems. Advanced uniform multirate channelization have been theoretically proved to be capable of reducing the computational load, with a better performance. Therefore, in this thesis, we implement these designs on a FPGA board for the sake of the comprehensive evaluation of resource usage, performance and frequency response. The uniform filter-banks are one of the most essential unit in channelization. The Generalised Discrete Fourier Transform Modulated Filter Bank (GDFT-FB), as an important variant of basic a DFT-FB, has been implemented in FPGA and demonstrated with a better computational saving rather than traditional schemes. Moreover the oversampling version is demonstrated to have a better frequency response with an acceptable amount of extra resources. On the other hand, frequency response masking (FRM) techniques is able to reduce the number of coefficients. Therefore, the full FRM GDFT-FB and alternative narrowband FRM GDFT-FB are both implemented in FPGA platform, in order to achieve a better performance and hardware efficiency

    A Solder-Defined Computer Architecture for Backdoor and Malware Resistance

    Get PDF
    This research is about securing control of those devices we most depend on for integrity and confidentiality. An emerging concern is that complex integrated circuits may be subject to exploitable defects or backdoors, and measures for inspection and audit of these chips are neither supported nor scalable. One approach for providing a “supply chain firewall” may be to forgo such components, and instead to build central processing units (CPUs) and other complex logic from simple, generic parts. This work investigates the capability and speed ceiling when open-source hardware methodologies are fused with maker-scale assembly tools and visible-scale final inspection. The author has designed, and demonstrated in simulation, a 36-bit CPU and protected memory subsystem that use only synchronous static random access memory (SRAM) and trivial glue logic integrated circuits as components. The design presently lacks preemptive multitasking, ability to load firmware into the SRAMs used as logic elements, and input/output. Strategies are presented for adding these missing subsystems, again using only SRAM and trivial glue logic. A load-store architecture is employed with four clock cycles per instruction. Simulations indicate that a clock speed of at least 64 MHz is probable, corresponding to 16 million instructions per second (16 MIPS), despite the architecture containing no microprocessors, field programmable gate arrays, programmable logic devices, application specific integrated circuits, or other purchased complex logic. The lower speed, larger size, higher power consumption, and higher cost of an “SRAM minicomputer,” compared to traditional microcontrollers, may be offset by the fully open architecture—hardware and firmware—along with more rigorous user control, reliability, transparency, and auditability of the system. SRAM logic is also particularly well suited for building arithmetic logic units, and can implement complex operations such as population count, a hash function for associative arrays, or a pseudorandom number generator with good statistical properties in as few as eight clock cycles per 36-bit word processed. 36-bit unsigned multiplication can be implemented in software in 47 instructions or fewer (188 clock cycles). A general theory is developed for fast SRAM parallel multipliers should they be needed
    • …
    corecore