19 research outputs found

    Low power two-channel PR QMF bank using CSD coefficients and FPGA implementation

    Get PDF
    Finite impulse response (FIR) filter is a fundamental component in digital signal processing. Two-channel perfect reconstruction (PR) QMF banks are widely used in many applications, such as image coding, speech processing and communications. A practical lattice realization of two-channel QMF bank is developed in this thesis for dealing with the wide dynamic range of intermediate results in lattice structure. To achieve low complexity and low power consumption of two-channel perfect reconstruction QMF bank, canonical signed digit (CSD) number system is used for representing lattice coefficients in FPGA implementation. Utilization of CSD number system in lattice structures leads to more efficient hardware implementation. Many fixed-point simulations were done in Matlab in order to obtain the proper fixed-point word-length for different signals. Finally, FPGA implementation results show that perfect reconstruction signal is obtained by using the proposed method. Furthermore, the power consumption using CSD number system for representing lattice coefficients is less than that obtained by using two\u27s complement number system in two-channel QMF bank. A low complexity and low power two-channel PR QMF bank using CSD coefficients was realized

    LOW POWER MULTIPLIER USING ALGORITHMIC NOISE TOLERANT ARCHITECTURE

    Get PDF
    : A multiplier is one of the key hardware blocks in most digital signal processing (DSP) systems. Typical DSP applications where a multiplier plays an important role include digital filtering, digital communications and spectral analysis (Ayman.A et al (2001)). Many current DSP applications are targeted at portable, battery-operated systems, so that power dissipation becomes one of the primary design constraints. Since multipliers are rather complex circuits and must typically operate at a high system clock rate, reducing the delay of a multiplier is an essential part of satisfying the overall design. In this project a multiplier block has been designed through the algorithmic noise tolerance architectures (ANT) by using Wallace multiplier. A reliable low power multiplier design with the fixed width multiplier block through the reduced precision replica redundancy (RPR) and main block design with Wallace multiplier . The new architecture can meet the high accuracy, low power consumption and area efficiency when compared with previous multiplier circuit

    IMPLEMENTATION OF HIGH-SPEED MULTIPLIER FILTERS USING A MODIFIED NON RECURSIVE COMMON DADA MULTIPLIER

    Get PDF
    A multiplier is one of the key hardware blocks in most digital signal processing (DSP) systems. Typical DSP applications where a multiplier plays an important role include digital filtering, digital communications and spectral analysis (Ayman.A et al (2001)). Many current DSP applications are targeted at portable, battery-operated systems, so that power dissipation becomes one of the primary design constraints. Since multipliers are rather complex circuits and must typically operate at a high system clock rate, reducing the delay of a multiplier is an essential part of satisfying the overall design. In this project two different multipliers are designed which are array multiplier and modified dada multiplier along with the combination of truncated multiplier. The comparison is carried out using the EDA tool XILINX ISE 12.3i by developing the RTL (Register Transfer Level) using the VERILOG HDL

    Energy-efficient Hardware Accelerator Design for Convolutional Neural Network

    Get PDF
    Department of Electrical EngineeringConvolutional neural network (CNN) is a class of deep neural networks, which shows a superior performance in handling images. Due to its performance, CNN is widely used to classify an object or detect the position of object from an image. CNN can be implemented on either edge devices or cloud servers. Since the cloud servers have high computational capabilities, CNN on cloud can perform a large number of tasks at once with a high throughput. However, CNN on the cloud requires a long round-trip time. To infer an image picture, data from a sensor should be uploaded to the cloud server, and processed information from CNN is transferred to the user. If an application requires a rapid response in a certain situation, the long round-trip time of cloud is a critical issue. On the other hand, an edge device has a very short latency, even though it has limited computing resources. In addition, since the edge device does not require the transmission of images over network, its performance would not be affected by the bandwidth of network. Because of these features, it is ef???cient to use the cloud for CNN computing in most cases, but the edge device is preferred in some applications. For example, CNN algorithm for autonomous car requires rapid responses. CNN on cloud requires transmission and reception of images through network, and it cannot respond quickly to users. This problem becomes more serious when a high-resolution input image is required. On the other hand, the edge device does not require the data transmission, and it can response very quickly. Edge devices would be also suitable for CNN applications involving privacy or security. However, the edge device has limited energy resource, the energy ef???ciency of the CNN accelerator is a very important issue. Embedded CNN accelerator consists of off-chip memory, host CPU and a hardware accelerator. The hardware accelerator consists of the main controller, global buffer and arrays of processing elements (PE). It also has a separate compression module and activation module. In this dissertation, we propose energy-ef???cient design in three different parts. First, we propose a time-multiplexing PE to increase the energy ef???ciency of multipliers. From the fact the feature maps have small values which are de???ned as non-outliers, we increase the energy ef???ciency for computing non-outliers. For further improving the energy efficiency of PE, approximate computing is also introduced. Method to optimize the trade-off between accuracy and energy is also proposed. Second, we investigate the energy-ef???cient accuracy recovery circuit. For the implementation of CNN on edge, CNN loops are usually tiled. During tiling of CNN loops, accuracy can be degraded. We analyze the accuracy reduction due to tiling and recover accuracy by extending et al. of partial sums with very small energy overhead. Third, we reduce energy consumption for DRAM accessing. CNN requires massive data transmission between on-chip and off-chip memory. The energy consumption of data transmission accounts for a large portion of total energy consumption. We propose a spatial correlation-aware compression algorithm to reduce the transmission of feature maps. In each of these three levels, this dissertation proposes novel optimization and design ???ows which increase the energy ef???ciency of CNN accelerator on edge.clos

    Linear-Phase FIR Digital Filter ‎Design with Reduced Hardware Complexity using Discrete Differential Evolution

    Get PDF
    Optimal design of xed coe cient nite word length linear phase FIR digital lters for custom ICs has been the focus of research in the past decade. With the ever increasing demands for high throughput and low power circuits, the need to design lters with reduced hardware complexity has become more crucial. Multiplierless lters provide substantial saving in hardware by using a shift add network to generate the lter coe cients. In this thesis, the multiplierless lter design problem is modeled as combinatorial optimization problem and is solved using a discrete Di erential Evolution algorithm. The Di erential Evolution algorithm\u27s population representation adapted for the nite word length lter design problem is developed and the mutation operator is rede ned for discrete valued parameters. Experiments show that the method is able to design lters up to a length of 300 taps with reduced hardware and shorter design times

    Digital Filters

    Get PDF
    The new technology advances provide that a great number of system signals can be easily measured with a low cost. The main problem is that usually only a fraction of the signal is useful for different purposes, for example maintenance, DVD-recorders, computers, electric/electronic circuits, econometric, optimization, etc. Digital filters are the most versatile, practical and effective methods for extracting the information necessary from the signal. They can be dynamic, so they can be automatically or manually adjusted to the external and internal conditions. Presented in this book are the most advanced digital filters including different case studies and the most relevant literature

    Design and Implementation of Complexity Reduced Digital Signal Processors for Low Power Biomedical Applications

    Get PDF
    Wearable health monitoring systems can provide remote care with supervised, inde-pendent living which are capable of signal sensing, acquisition, local processing and transmission. A generic biopotential signal (such as Electrocardiogram (ECG), and Electroencephalogram (EEG)) processing platform consists of four main functional components. The signals acquired by the electrodes are amplified and preconditioned by the (1) Analog-Front-End (AFE) which are then digitized via the (2) Analog-to-Digital Converter (ADC) for further processing. The local digital signal processing is usually handled by a custom designed (3) Digital Signal Processor (DSP) which is responsible for either anyone or combination of signal processing algorithms such as noise detection, noise/artefact removal, feature extraction, classification and compres-sion. The digitally processed data is then transmitted via the (4) transmitter which is renown as the most power hungry block in the complete platform. All the afore-mentioned components of the wearable systems are required to be designed and fitted into an integrated system where the area and the power requirements are stringent. Therefore, hardware complexity and power dissipation of each functional component are crucial aspects while designing and implementing a wearable monitoring platform. The work undertaken focuses on reducing the hardware complexity of a biosignal DSP and presents low hardware complexity solutions that can be employed in the aforemen-tioned wearable platforms. A typical state-of-the-art system utilizes Sigma Delta (Σ∆) ADCs incorporating a Σ∆ modulator and a decimation filter whereas the state-of-the-art decimation filters employ linear phase Finite-Impulse-Response (FIR) filters with high orders that in-crease the hardware complexity [1–5]. In this thesis, the novel use of minimum phase Infinite-Impulse-Response (IIR) decimators is proposed where the hardware complexity is massively reduced compared to the conventional FIR decimators. In addition, the non-linear phase effects of these filters are also investigated since phase non-linearity may distort the time domain representation of the signal being filtered which is un-desirable effect for biopotential signals especially when the fiducial characteristics carry diagnostic importance. In the case of ECG monitoring systems the effect of the IIR filter phase non-linearity is minimal which does not affect the diagnostic accuracy of the signals. The work undertaken also proposes two methods for reducing the hardware complexity of the popular biosignal processing tool, Discrete Wavelet Transform (DWT). General purpose multipliers are known to be hardware and power hungry in terms of the number of addition operations or their underlying building blocks like full adders or half adders required. Higher number of adders leads to an increase in the power consumption which is directly proportional to the clock frequency, supply voltage, switching activity and the resources utilized. A typical Field-Programmable-Gate-Array’s (FPGA) resources are Look-up Tables (LUTs) whereas a custom Digital Signal Processor’s (DSP) are gate-level cells of standard cell libraries that are used to build adders [6]. One of the proposed methods is the replacement of the hardware and power hungry general pur-pose multipliers and the coefficient memories with reconfigurable multiplier blocks that are composed of simple shift-add networks and multiplexers. This method substantially reduces the resource utilization as well as the power consumption of the system. The second proposed method is the design and implementation of the DWT filter banks using IIR filters which employ less number of arithmetic operations compared to the state-of-the-art FIR wavelets. This reduces the hardware complexity of the analysis filter bank of the DWT and can be employed in applications where the reconstruction is not required. However, the synthesis filter bank for the IIR wavelet transform has a higher computational complexity compared to the conventional FIR wavelet synthesis filter banks since re-indexing of the filtered data sequence is required that can only be achieved via the use of extra registers. Therefore, this led to the proposal of a novel design which replaces the complex IIR based synthesis filter banks with FIR fil-ters which are the approximations of the associated IIR filters. Finally, a comparative study is presented where the hybrid IIR/FIR and FIR/FIR wavelet filter banks are de-ployed in a typical noise reduction scenario using the wavelet thresholding techniques. It is concluded that the proposed hybrid IIR/FIR wavelet filter banks provide better denoising performance, reduced computational complexity and power consumption in comparison to their IIR/IIR and FIR/FIR counterparts

    Cross-Layer Automated Hardware Design for Accuracy-Configurable Approximate Computing

    Get PDF
    Approximate Computing trades off computation accuracy against performance or energy efficiency. It is a design paradigm that arose in the last decade as an answer to diminishing returns from Dennard\u27s scaling and a shift in the prominent workloads. A range of modern workloads, categorized mainly as recognition, mining, and synthesis, features an inherent tolerance to approximations. Their characteristics, such as redundancies in their input data and robust-to-noise algorithms, allow them to produce outputs of acceptable quality, despite an approximation in some of their computations. Approximate Computing leverages the application tolerance by relaxing the exactness in computation towards primary design goals of increasing performance or improving energy efficiency. Existing techniques span across the abstraction layers of computer systems where cross-layer techniques are shown to offer a larger design space and yield higher savings. Currently, the majority of the existing work aims at meeting a single accuracy. The extent of approximation tolerance, however, significantly varies with a change in input characteristics and applications. In this dissertation, methods and implementations are presented for cross-layer and automated design of accuracy-configurable Approximate Computing to maximally exploit the performance and energy benefits. In particular, this dissertation addresses the following challenges and introduces novel contributions: A main Approximate Computing category in hardware is to scale either voltage or frequency beyond the safe limits for power or performance benefits, respectively. The rationale is that timing errors would be gradual and for an initial range tolerable. This scaling enables a fine-grain accuracy-configurability by varying the timing error occurrence. However, conventional synthesis tools aim at meeting a single delay for all paths within the circuit. Subsequently, with voltage or frequency scaling, either all paths succeed, or a large number of paths fail simultaneously, with a steep increase in error rate and magnitude. This dissertation presents an automated method for minimizing path delays by individually constraining the primary outputs of combinational circuits. As a result, it reduces the number of failing paths and makes the timing errors significantly more gradual, and also rarer and smaller on average. Additionally, it reveals that delays can be significantly reduced towards the least significant bit (LSB) and allows operating at a higher frequency when small operands are computed. Precision scaling, i.e., reducing the representation of data and its accuracy is widely used in multiple abstraction layers in Approximate Computing. Reducing data precision also reduces the transistor toggles, and therefore the dynamic power consumption. Application and architecture level precision scaling results in using only LSBs of the circuit. Arithmetic circuits often have less complexity and logic depth in LSBs compared to most significant bits (MSB). To take advantage of this circuit property, a delay-altering synthesis methodology is proposed. The method finds energy-optimal delay values under configurable precision usage and assigns them to primary outputs used for different precisions. Thereby, it enables dynamic frequency-precision scalable circuits for energy efficiency. Within the hardware architecture, it is possible to instantiate multiple units with the same functionality with different fixed approximation levels, where each block benefits from having fewer transistors and also synthesis relaxations. These blocks can be selected dynamically and thus allow to configure the accuracy during runtime. Instantiating such approximate blocks can be a lower dynamic power but higher area and leakage cost alternative to the current state-of-the-art gating mechanisms which switch off a group of paths in the circuit to reduce the toggling activity. Jointly, instantiating multiple blocks and gating mechanisms produce a large design space of accuracy-configurable hardware, where energy-optimal solutions require a cross-layer search in architecture and circuit levels. To that end, an approximate hardware synthesis methodology is proposed with joint optimizations in architecture and circuit for dynamic accuracy scaling, and thereby it enables energy vs. area trade-offs

    The automated compilation of comprehensive hardware design search spaces of algorithmic-based implementations for FPGA design exploration

    Get PDF
    Over the past few years FPGA hardware has become a logical choice for implementing cutting-edge signal processing applications. While there have been advances in FPGA technology, the common process of creating specialized hardware implementations for them is a manual one involving extensive design exploration. Design exploration is a process that requires a designer to look for designs that ¯t a set of performance characteristics such as size, throughput, or power depending on the application and it can be the most time consuming step when creating FPGA hardware. This process is a nontrivial task that requires extensive background knowledgeof both FPGA hardware and the application being implemented. While advances have been made in automating the process of design, there is still a gap between the application writers and hardware engineers that can be filled.This thesis presents a novel approach for automating the generation of hardware design search spaces that contain a comprehensive set of ways to implement signal processing algorithms with FPGAs. To accomplish this we generate a set of equivalent mathematical representations for an input equation via a novel declarative programming language that avoids a number of di±culties associated with the imperative languages used by previous approaches. We show that this equation space is bounded in terms of bracketing and ordering of mathematical operations, and that by changing the way an equation is written we can generate unique hardware instantiations (designs). The generated instantiations are mapped to heterogeneous computing architectures and written in a structural hardware descriptive language style to ensure that the intended instantiation will behave as predicted in hardware.A software system was created based on this approach that generates an equation space for varying numbers of summed multiplications and converts each representation into a comprehensive hardware design search space that can be analyzed for performance characteristics such as size, throughput, latency, and power.Ph.D., Electrical Engineering -- Drexel University, 200
    corecore