2,511 research outputs found
Implementation of a Combined OFDM-Demodulation and WCDMA-Equalization Module
For a dual-mode baseband receiver for the OFDMWireless LAN andWCDMA standards, integration of the demodulation and equalization tasks on a dedicated hardware module has been investigated. For OFDM demodulation, an FFT algorithm based on cascaded twiddle factor decomposition has been selected. This type of algorithm combines high spatial and temporal regularity in the FFT data-flow graphs with a minimal number of computations. A frequency-domain algorithm based on a circulant channel approximation has been selected for WCDMA equalization. It has good performance, low hardware complexity and a low number of computations. Its main advantage is the reuse of the FFT kernel, which contributes to the integration of both tasks. The demodulation and equalization module has been described at the register transfer level with the in-house developed Arx language. The core of the module is a pipelined radix-23 butterfly combined with a complex multiplier and complex divider. The module has an area of 0.447 mm2 in 0.18 Āæm technology and a power consumption of 10.6 mW. The proposed module compares favorably with solutions reported in literature
High Resolution Single-Chip Radix II FFT Processor for High- Tech Application
Electrical motors are vital components of many industrial processes and their operation failure leads losing in production line. Motor functionality and its behavior should be monitored to avoid production failure catastrophe. Hence, a highātech DSP processor is a significant method for electrical harmonic analysis that can be realized as embedded systems. This chapter introduces principal embedded design of novel highātech 1024āpoint FFT processor architecture for high performance harmonic measurement techniques. In FFT processor algorithm pipelining and parallel implementation are incorporated in order to enhance the performance. The proposed FFT makes use of floating point to realize higher precision FFT. Since floatingāpoint architecture limits the maximum clock frequency and increases the power consumption, the chapter focuses on improving the speed, area, resolution and power consumption, as well as latency for the FFT. It illustrates very largeāscale integration (VLSI) implementation of the floatingāpoint parallel pipelined (FPP) 1024āpoint Radix II FFT processor with applying novel architecture that makes use of only single butterfly incorporation of intelligent controller. The functionality of the conventional Radix II FFT was verified as embedded in FPGA prototyping. For area and power consumption, the proposed Radix II FPPāFFT was optimized in ASIC under Silterra 0.18 Āµm and Mimos 0.35 Āµm technology libraries
VLSI architecture design approaches for real-time video processing
This paper discusses the programmable and dedicated approaches for real-time video processing applications. Various VLSI architecture including the design examples of both approaches are reviewed. Finally, discussions of several practical designs in real-time video processing applications are then considered in VLSI architectures to provide significant guidelines to VLSI designers for any further real-time video processing design works
Fast, area-efficient 32-bit LNS for computer arithmetic operations
PhD ThesisThe logarithmic number system has been proposed as an alternative to floating-point.
Multiplication, division and square-root operations are accomplished with fixedpoint
arithmetic, but addition and subtraction are considerably more challenging.
Recent work has demonstrated that these operations too can be done with similar
speed and accuracy to their floating-point equivalents, but the necessary circuitry is
complex. In particular, it is dominated by the need for large lookup tables for the
storage of a non-linear function.
This thesis describes the architectures required to implement a newly design
approach for producing fast and area-efficient 32-bit LNS arithmetic unit. The
designs are structured based on two different algorithms. At first, a new cotransformation
procedure is introduced in the singularity region whilst performing
subtractions in which the technique capable to generate less total storage than the cotransformation
method in the previous LNS architecture. Secondly, improvement to
an existing interpolation process is proposed, that also reduce the total tables to an
extent that allows their easy synthesis in logic. Consequently, the total delays in the
system can be significantly reduced.
According to the comparison analysis with previous best LNS design and
floating-point units, it is shown that the new LNS architecture capable to offer
significantly better in speed while sustaining its accuracy within floating-point limit.
In addition, its implementation is more economical than previous best LNS system
and almost equivalent with existing floating-point arithmetic unit.University Malaysia Perlis:
Ministry of Higher Education, Malaysia
A Reconfigurable Tile-Based Architecture to Compute FFT and FIR Functions in the Context of Software-Defined Radio
Software-defined radio (SDR) is the term used for flexible radio systems that can deal with multiple standards. For an efficient implementation, such systems require appropriate reconfigurable architectures. This paper targets the efficient implementation of the most computationally intensive kernels of two significantly different standards, viz. Bluetooth and HiperLAN/2, on the same reconfigurable hardware. These kernels are FIR filtering and FFT. The designed architecture is based on a two-dimensional arrangement of 17 tiles. Each tile contains a multiplier, an adder, local memory and multiplexers allowing flexible communication with the neighboring tiles. The tile-base data path is complemented with a global controller and various memories. The design has been implemented in SystemC and simulated extensively to prove equivalence with a reference all-software design. It has also been synthesized and turns out to outperform significantly other reconfigurable designs with respect to speed and area
An Analog VLSI Deep Machine Learning Implementation
Machine learning systems provide automated data processing and see a wide range of applications. Direct processing of raw high-dimensional data such as images and video by machine learning systems is impractical both due to prohibitive power consumption and the ācurse of dimensionality,ā which makes learning tasks exponentially more difficult as dimension increases. Deep machine learning (DML) mimics the hierarchical presentation of information in the human brain to achieve robust automated feature extraction, reducing the dimension of such data. However, the computational complexity of DML systems limits large-scale implementations in standard digital computers. Custom analog signal processing (ASP) can yield much higher energy efficiency than digital signal processing (DSP), presenting means of overcoming these limitations.
The purpose of this work is to develop an analog implementation of DML system.
First, an analog memory is proposed as an essential component of the learning systems. It uses the charge trapped on the floating gate to store analog value in a non-volatile way. The memory is compatible with standard digital CMOS process and allows random-accessible bi-directional updates without the need for on-chip charge pump or high voltage switch.
Second, architecture and circuits are developed to realize an online k-means clustering algorithm in analog signal processing. It achieves automatic recognition of underlying data pattern and online extraction of data statistical parameters. This unsupervised learning system constitutes the computation node in the deep machine learning hierarchy.
Third, a 3-layer, 7-node analog deep machine learning engine is designed featuring online unsupervised trainability and non-volatile floating-gate analog storage. It utilizes massively parallel reconfigurable current-mode analog architecture to realize efficient computation. And algorithm-level feedback is leveraged to provide robustness to circuit imperfections in analog signal processing. At a processing speed of 8300 input vectors per second, it achieves 1Ć1012 operation per second per Watt of peak energy efficiency.
In addition, an ultra-low-power tunable bump circuit is presented to provide similarity measures in analog signal processing. It incorporates a novel wide-input-range tunable pseudo-differential transconductor. The circuit demonstrates tunability of bump center, width and height with a power consumption significantly lower than previous works
Analog Vector-Matrix Multiplier Based on Programmable Current Mirrors for Neural Network Integrated Circuits
We propose a CMOS Analog Vector-Matrix Multiplier for Deep Neural Networks, implemented in a standard single-poly 180 nm CMOS technology. The learning weights are stored in analog floating-gate memory cells embedded in current mirrors implementing the multiplication operations. We experimentally verify the analog storage capability of designed single-poly floating-gate cells, the accuracy of the multiplying function of proposed tunable current mirrors, and the effective number of bits of the analog operation. We perform system-level simulations to show that an analog deep neural network based on the proposed vector-matrix multiplier can achieve an inference accuracy comparable to digital solutions with an energy efficiency of 26.4 TOPs/J, a layer latency close to 100 mu s and an intrinsically high degree of parallelism. Our proposed design has also a cost advantage, considering that it can be implemented in a standard single-poly CMOS process flow
- ā¦