3 research outputs found

    Design-Space Exploration of Mixed-precision DNN Accelerators based on Sum-Together Multipliers

    Get PDF
    Mixed-precision quantization (MPQ) is gaining momentum in academia and industry as a way to improve the trade-off between accuracy and latency of Deep Neural Networks (DNNs) in edge applications. MPQ requires dedicated hardware to support different bit-widths. One approach uses Precision-Scalable MAC units (PSMACs) based on multipliers operating in Sum-Together (ST) mode. These can be configured to compute N = 1, 2, 4 multiplications/dot-products in parallel with operands at 16/N bits. We contribute to the State of the Art (SoA) in three directions: we compare for the first time the SoA ST multipliers architectures in performance, power and area; compared to previous work, we contribute to the portfolio of ST-based accelerators proposing three designs for the most common DNN algorithms: 2D-Convolution, Depth-wise Convolution and Fully-Connected; we show how these accelerators can be obtained with a High-Level Synthesis (HLS) flow. In particular, we perform a design-space exploration (DSE) in area, latency, power, varying many knobs, including PSMAC units parallelism, clock frequency and ST multipliers type. From the DSE on a 28-nm technology we observe that both at multiplier level and at accelerator level there is no one-fits-all solution for each possible scenario. Our findings allow accelerators’ designers to choose, out of a rich variety, the best combination of ST multiplier and HLS knobs depending on the target, either high performance, low area, or low power

    A high-performance inner-product processor for real and complex numbers.

    Get PDF
    A novel, high-performance fixed-point inner-product processor based on a redundant binary number system is investigated in this dissertation. This scheme decreases the number of partial products to 50%, while achieving better speed and area performance, as well as providing pipeline extension opportunities. When modified Booth coding is used, partial products are reduced by almost 75%, thereby significantly reducing the multiplier addition depth. The design is applicable for digital signal and image processing applications that require real and/or complex numbers inner-product arithmetic, such as digital filters, correlation and convolution. This design is well suited for VLSI implementation and can also be embedded as an inner-product core inside a general purpose or DSP FPGA-based processor. Dynamic control of the computing structure permits different computations, such as a variety of inner-product real and complex number computations, parallel multiplication for real and complex numbers, and real and complex number division. The same structure can also be controlled to accept redundant binary number inputs for multiplication and inner-product computations. An improved 2's-complement to redundant binary converter is also presented

    Energy area and speed optimized signal processing on FPGA

    Get PDF
    Matrix multiplication and Fast Fourier transform are two computational intensive DSP functions widely used as kernel operations in the applications such as graphics, imaging and wireless communication. Traditionally the performance metrics for signal processing has been latency and throughput. Energy efficiency has become increasingly important with proliferation of portable mobile devices as in software defined radio. A FPGA based system is a viable solution for requirement of adaptability and high computational power. But one limitation in FPGA is the limitation of resources. So there is need for optimization between energy, area and latency. There are numerous ways to map an algorithm to FPGA. So for the process of optimization the parameters must be determined by low level simulation of each of the designs possible which gives rise to vast time consumption. So there is need for a high level energy model in which parameters can be determined at algorithm and architectural level rather than low level simulation. In this dissertation matrix multiplication algorithms are implemented with pipelining and parallel processing features to increase throughput and reduce latency there by reduce the energy dissipation. But it increases area by the increased numbers of processing elements. The major area of the design is used by multiplier which further increases with increase in input word width which is difficult for VLSI implementation. So a word width decomposition technique is used with these algorithms to keep the size of multipliers fixed irrespective of the width of input data. FFT algorithms are implemented with pipelining to increase throughput. To reduce energy and area due to the complex multipliers used in the design for multiplication with twiddle factors, distributed arithmetic is used to provide multiplier less architecture. To compensate speed performance parallel distributed arithmetic models are used. This dissertation also proposes method of optimization of the parameters at high level for these two kernel applications by constructing a high level energy model using specified algorithms and architectures. Results obtained from the model are compared with those obtained from low level simulation for estimation of error
    corecore