11 research outputs found
An Area and Energy Efficient Serial-Multiplier
In this work, we present an area and energy-efficient serial multiplier. Specifically, we exploit symmetries in odd and even partial products (PPs) in its radix-γ implementation. Subsequently, we express them as ∓ (2k±1) with 1≤k≤log2γ−1, which enable to reduce the hardware resources. For γ≥16, the above representation becomes invalid, requiring additional power-of-two terms and raising hardware costs. To address this, we utilize recursive symmetries in PPs, which enable time-sharing and reduce the logic resources for efficient realization. ASIC synthesis results show the proposed design has substantial savings in area and energy than the state-of-the-art design
An Area and Energy Efficient Serial-Multiplier
In this work, we present an area and energy-efficient serial multiplier. Specifically, we exploit symmetries in odd and even partial products (PPs) in its radix-γ implementation. Subsequently, we express them as ∓ (2k±1) with 1≤k≤log2γ−1, which enable to reduce the hardware resources. For γ≥16, the above representation becomes invalid, requiring additional power-of-two terms and raising hardware costs. To address this, we utilize recursive symmetries in PPs, which enable time-sharing and reduce the logic resources for efficient realization. ASIC synthesis results show the proposed design has substantial savings in area and energy than the state-of-the-art design
Low-complexity continuous-flow memory-based FFT architectures for real-valued signals
This paper presents two low-complexity continuous-flow memory-based fast Fourier transform (FFT) architectures (Type-I, II) for real-valued signals. Both the proposed designs employ split processing-elements (SPEs), however Type-I SPE processes four inputs while Type-II SPE processes two inputs in parallel. The SPE in Type-I design contains a half-complex multiplier and that of Type-II design contains a quarter-complex multiplier. Two new memory accessing schemes corresponding to each design is proposed. Analysis of computational complexities for both the architectures are carried out and compared with existing designs. It is found that Type-I architecture provides low-complexity in terms of registers and multiplexers while Type-II architecture provides low-complexity in terms of the multiplier. Application specific integrated circuit (ASIC) synthesis and field programmable gate array (FPGA) implementation results show that the proposed designs offer low-area, low-power and utilize less logic elements. For instance, 32-point Type-I real FFT offer requires 5.06% less area, 15.1% less power, 6.58% less sliced look-up table (SLUT) and 5.25% less FF while Type-II 47.76% less area, 43.64% power, 48.22% less SLUT and 43.48% less FF over the best existing scheme.</p
An Efficient Implementation Approach to FFT Processor for Spectral Analysis
This article presents an efficient hardware implementation approach to a variable-size fast Fourier transform (FFT) processor for spectral analysis. Due to its capability to handle different frame sizes, it can be adapted in situations where operating parameters necessitate adhering to different standard requirements. A serial real-valued processor with a new data-flow graph is considered, as it requires the least number of multipliers. By joint use of stage-specific optimization and multiplierless structure, the overall hardware efficiency of the proposed design is enhanced. Clock gating is employed to enable the variable-size processor operation along with power reduction. A fixed-point (FP) analysis of the proposed design is considered. The proposed novel multiplierless structure is based on shift and accumulation (SA). This also includes the generation (and sharing) of partial products (PPs) based on their symmetries. The proposed design offers low area and low power as compared with the state of the art. It is demonstrated for spectral analysis of electroencephalogram (EEG) signals for machine-learning-based epileptic seizure prediction on a field-programmable gate array (FPGA) platform.</p
An Efficient Implementation Approach to FFT Processor for Spectral Analysis
This article presents an efficient hardware implementation approach to a variable-size fast Fourier transform (FFT) processor for spectral analysis. Due to its capability to handle different frame sizes, it can be adapted in situations where operating parameters necessitate adhering to different standard requirements. A serial real-valued processor with a new data-flow graph is considered, as it requires the least number of multipliers. By joint use of stage-specific optimization and multiplierless structure, the overall hardware efficiency of the proposed design is enhanced. Clock gating is employed to enable the variable-size processor operation along with power reduction. A fixed-point (FP) analysis of the proposed design is considered. The proposed novel multiplierless structure is based on shift and accumulation (SA). This also includes the generation (and sharing) of partial products (PPs) based on their symmetries. The proposed design offers low area and low power as compared with the state of the art. It is demonstrated for spectral analysis of electroencephalogram (EEG) signals for machine-learning-based epileptic seizure prediction on a field-programmable gate array (FPGA) platform.</p
High performance multiplierless serial pipelined VLSI architecture for real-valued FFT
This paper presents a high-performance multiplierless serial pipelined architecture for real-valued fast Fourier transform (FFT). A new data mapping scheme (DMS) is suggested for the proposed serial pipelined FFT architecture. The performance is enhanced by performing FFT computations in log- 2 N-1 stages followed by a select-store-feedback (SSF) stage, where N is the number of points in FFT. Further enhancement in performance is achieved by employing quarter-complex multiplierless unit made up of memory and combinational logic in every stage. The memory stores half number of partial products while the remaining partial products are taken care by external combinational logic. Compared with the best existing scheme, the proposed design reduces the computational workload on half-butterfly (H-BF) units by (2N-8). Application specific integrated circuit (ASIC) and field programmable gate array (FPGA) results show that the proposed design for 1024-point achieves 31.54% less area, 30.13% less power, 33.56% less area-delay product (ADP), 27.11% less sliced look-up tables (SLUTs) and 28.37% less flip-flops (FFs) as compared to the best existing scheme.</p
Energy efficient VLSI architecture of real-valued serial pipelined FFT
This study presents an energy-efficient serial pipelined architecture of fast Fourier transform (FFT) to process real-valued signals. A new data mapping scheme is presented to obtain a normal order input-output without the requirement of a post-processing stage. It facilitates reduction in the computational workload on the hardware resources which is confirmed through mathematical derivations. Further, the proposed design involves a novel quadrant multiplier with relatively lower hardware complexity. It performs the quarter operation of a complex multiplier in one clock cycle, and thereby consumes relatively lower power. Moreover, in the last stage, a merged unit for butterfly computation and data re-ordering is also proposed which performs either a half-butterfly operation or interchanges data, and thereby reduces the hardware usage. Application specific integrated circuit synthesis and field programmable gate array results show that for a 1024-points FFT computation, the proposed architecture offers 10.26% savings in area, 20.83% savings in power, 16.98% savings in area-delay product and 26.76% savings in energy-per-sample, 7.79% savings in sliced look-up tables, and 11.93% savings in flip-flops over the best existing design.</p
An area and power-efficient serial commutator fft with recursive lut multiplier
This paper presents an area and power-efficient architecture for serial commutator real-valued fast Fourier transform (FFT) using recursive look-up table (LUT). FFT computation consists of butterfly operations and twiddles factor multiplications. The area and power performance of FFT architectures are mainly limited by the multipliers. To address this, a new multiplier is proposed which stores the partial products in LUT. Moreover, by adding the shifted version of twiddle coefficients, the stored partial products gain symmetry, and thus the size of LUT can be reduced to half. Further symmetry is achieved by adding another shifted version of twiddle coefficients and so on. This makes the proposed LUT multiplier recursive in nature. A new data management scheme is suggested for the proposed architecture. To validate the proposed architecture, application-specific integrated circuit (ASIC) synthesis and field-programmable gate array (FPGA) implementation are carried out for different symmetry factor. For instance, the proposed architecture for 1024-point with symmetry factor of two achieves 39.11% less area, 42.29% less power, 33.27% less sliced LUT (SLUT) and 29.18% less flip-flop (FF) as compared to the best existing design.</p
A Novel Time-Shared and LUT-Less Pipelined Architecture for LMS Adaptive Filter
This article presents a novel time-shared and lookup table (LUT)-less pipelined architecture for a least-mean-square (LMS) adaptive filter (ADF). The proposed approach first employs a time-shared architecture for the pipelined LMS ADF to compute each filter partial product and coefficient increment term using a single multiplier. Critical path analysis of this architecture is carried out to determine the pipeline requirements. Next, a novel LUT-less multiplier is suggested by exploiting the symmetries between the odd-multiples. Due to the symmetries between the odd-multiples, offset terms are added using an adder tree. For higher wordlength coefficients, only a few adders are required to generate the odd-multiples, and only one offset adder tree is required. Finally, a novel super-latch is developed to pipeline the LUT-less multiplier with adaptation delays of the pipelined LMS ADF. From the implementation results, it is found that the proposed design for the 32nd-order filter occupies 60.32% less area, consumes 61.93% less power, and utilizes 58.83% less sliced LUTs and 63.28% fewer flip-flops over the best existing design
Low-Area and Low-Power VLSI Architectures for Long Short-Term Memory Networks
Long short-term memory (LSTM) networks are extensively used in various sequential learning tasks, including speech recognition. Their significance in real-world applications has prompted the demand for cost-effective and power-efficient designs. This paper introduces LSTM architectures based on distributed arithmetic (DA), utilizing circulant and block-circulant matrix-vector multiplications (MVMs) for network compression. The quantized weights-oriented approach for training circulant and block-circulant matrices is considered. By formulating fixed-point circulant/block-circulant MVMs, we explore the impact of kernel size on accuracy. Our DA-based approach employs shared full and partial methods of add-store/store-add followed by a select unit to realize an MVM. It is then coupled with a multi-partial strategy to reduce complexity for larger kernel sizes. Further complexity reduction is achieved by optimizing decoders of multiple select units. Pipelining in add-store enhances speed at the expense of a few pipelined registers. The results of the field-programmable gate array showcase the superiority of our proposed architectures based on the partial store-add method, delivering reductions of 98.71% in DSP slices, 33.59% in slice look-up tables, 13.43% in flip-flops, and 29.76% in power compared to the state-of-the-art.</p