2,325 research outputs found
Asynchronous Circuit Stacking for Simplified Power Management
As digital integrated circuits (ICs) continue to increase in complexity, new challenges arise for designers. Complex ICs are often designed by incorporating multiple power domains therefore requiring multiple voltage converters to produce the corresponding supply voltages. These converters not only take substantial on-chip layout area and/or off-chip space, but also aggregate the power loss during the voltage conversions that must occur fast enough to maintain the necessary power supplies. This dissertation work presents an asynchronous Multi-Threshold NULL Convention Logic (MTNCL) “stacked” circuit architecture that alleviates this problem by reducing the number of voltage converters needed to supply the voltage the ICs operate at. By stacking multiple MTNCL circuits between power and ground, supplying a multiple of VDD to the entire stack and incorporating simple control mechanisms, the dynamic range fluctuation problem can be mitigated. A 130nm Bulk CMOS process and a 32nm Silicon-on-Insulator (SOI) CMOS process are used to evaluate the theoretical effect of stacking different circuitry while running different workloads. Post parasitic physical implementations are then carried out in the 32nm SOI process for demonstrating the feasibility and analyzing the advantages of the proposed MTNCL stacking architecture
An Energy-Efficient Near-Data Processing Accelerator for DNNs that Optimizes Data Accesses
The constant growth of DNNs makes them challenging to implement and run
efficiently on traditional compute-centric architectures. Some accelerators
have attempted to add more compute units and on-chip buffers to solve the
memory wall problem without much success, and sometimes even worsening the
issue since more compute units also require higher memory bandwidth. Prior
works have proposed the design of memory-centric architectures based on the
Near-Data Processing (NDP) paradigm. NDP seeks to break the memory wall by
moving the computations closer to the memory hierarchy, reducing the data
movements and their cost as much as possible. The 3D-stacked memory is
especially appealing for DNN accelerators due to its high-density/low-energy
storage and near-memory computation capabilities to perform the DNN operations
massively in parallel. However, memory accesses remain as the main bottleneck
for running modern DNNs efficiently.
To improve the efficiency of DNN inference we present QeiHaN, a hardware
accelerator that implements a 3D-stacked memory-centric weight storage scheme
to take advantage of a logarithmic quantization of activations. In particular,
since activations of FC and CONV layers of modern DNNs are commonly represented
as powers of two with negative exponents, QeiHaN performs an implicit in-memory
bit-shifting of the DNN weights to reduce memory activity. Only the meaningful
bits of the weights required for the bit-shift operation are accessed. Overall,
QeiHaN reduces memory accesses by 25\% compared to a standard memory
organization. We evaluate QeiHaN on a popular set of DNNs. On average, QeiHaN
provides speedup and energy savings over a Neurocube-like
accelerator
FPGA based Uniform Channelizer Implementation
Channelizers are widely used in modern digital communication systems.
Advanced uniform multirate channelization have been theoretically proved to be
capable of reducing the computational load, with a better performance. Therefore,
in this thesis, we implement these designs on a FPGA board for the sake of the
comprehensive evaluation of resource usage, performance and frequency
response.
The uniform filter-banks are one of the most essential unit in channelization. The
Generalised Discrete Fourier Transform Modulated Filter Bank (GDFT-FB), as an
important variant of basic a DFT-FB, has been implemented in FPGA and
demonstrated with a better computational saving rather than traditional schemes.
Moreover the oversampling version is demonstrated to have a better frequency
response with an acceptable amount of extra resources. On the other hand,
frequency response masking (FRM) techniques is able to reduce the number of
coefficients. Therefore, the full FRM GDFT-FB and alternative narrowband FRM
GDFT-FB are both implemented in FPGA platform, in order to achieve a better
performance and hardware efficiency
A Solder-Defined Computer Architecture for Backdoor and Malware Resistance
This research is about securing control of those devices we most depend on for integrity and confidentiality. An emerging concern is that complex integrated circuits may be subject to exploitable defects or backdoors, and measures for inspection and audit of these chips are neither supported nor scalable. One approach for providing a “supply chain firewall” may be to forgo such components, and instead to build central processing units (CPUs) and other complex logic from simple, generic parts. This work investigates the capability and speed ceiling when open-source hardware methodologies are fused with maker-scale assembly tools and visible-scale final inspection. The author has designed, and demonstrated in simulation, a 36-bit CPU and protected memory subsystem that use only synchronous static random access memory (SRAM) and trivial glue logic integrated circuits as components. The design presently lacks preemptive multitasking, ability to load firmware into the SRAMs used as logic elements, and input/output. Strategies are presented for adding these missing subsystems, again using only SRAM and trivial glue logic. A load-store architecture is employed with four clock cycles per instruction. Simulations indicate that a clock speed of at least 64 MHz is probable, corresponding to 16 million instructions per second (16 MIPS), despite the architecture containing no microprocessors, field programmable gate arrays, programmable logic devices, application specific integrated circuits, or other purchased complex logic. The lower speed, larger size, higher power consumption, and higher cost of an “SRAM minicomputer,” compared to traditional microcontrollers, may be offset by the fully open architecture—hardware and firmware—along with more rigorous user control, reliability, transparency, and auditability of the system. SRAM logic is also particularly well suited for building arithmetic logic units, and can implement complex operations such as population count, a hash function for associative arrays, or a pseudorandom number generator with good statistical properties in as few as eight clock cycles per 36-bit word processed. 36-bit unsigned multiplication can be implemented in software in 47 instructions or fewer (188 clock cycles). A general theory is developed for fast SRAM parallel multipliers should they be needed
- …