1,066 research outputs found
A Reconfigurable Digital Multiplier and 4:2 Compressor Cells Design
With the continually growing use of portable computing devices and increasingly complex software applications, there is a constant push for low power high speed circuitry to support this technology. Because of the high usage and large complex circuitry required to carry out arithmetic operations used in applications such as digital signal processing, there has been a great focus on increasing the efficiency of computer arithmetic circuitry. A key player in the realm of computer arithmetic is the digital multiplier and because of its size and power consumption, it has moved to the forefront of today\u27s research. A digital reconfigurable multiplier architecture will be introduced. Regulated by a 2-bit control signal, the multiplier is capable of double and single precision multiplication, as well as fault tolerant and dual throughput single precision execution. The architecture proposed in this thesis is centered on a recursive multiplication algorithm, where a large multiplication is carried out using recursions of simpler submultiplier modules. Within each sub-multiplier module, instead of carry save adder arrays, 4:2 compressor rows are utilized for partial product reduction, which present greater efficiency, thus result in lower delay and power consumption of the whole multiplier. In addition, a study of various digital logic circuit styles are initially presented, and then three different designs of 4:2 compressor in Domino Logic are presented and simulation results confirm the property of proposed design in terms of delay, power consumption and operation frequenc
Arithmetic core generation using bit heaps
International audienceA bit heap is a data structure that holds the unevaluated sum of an arbitrary number of bits, each weighted by some power of two. Most advanced arithmetic cores can be viewed as involving one or several bit heaps. We claim here that this point of view leads to better global optimization at the algebraic level, at the circuit level, and in terms of software engineering. To demonstrate it, a generic software framework is introduced for the definition and optimization of bit heaps. This framework, targeting DSP-enabled FPGAs, is developed within the open-source FloPoCo arithmetic core generator. Its versatility is demonstrated on several examples: multipliers, complex multipliers, polynomials, and discrete cosine transform
A Survey on Approximate Multiplier Designs for Energy Efficiency: From Algorithms to Circuits
Given the stringent requirements of energy efficiency for Internet-of-Things
edge devices, approximate multipliers, as a basic component of many processors
and accelerators, have been constantly proposed and studied for decades,
especially in error-resilient applications. The computation error and energy
efficiency largely depend on how and where the approximation is introduced into
a design. Thus, this article aims to provide a comprehensive review of the
approximation techniques in multiplier designs ranging from algorithms and
architectures to circuits. We have implemented representative approximate
multiplier designs in each category to understand the impact of the design
techniques on accuracy and efficiency. The designs can then be effectively
deployed in high-level applications, such as machine learning, to gain energy
efficiency at the cost of slight accuracy loss.Comment: 38 pages, 37 figure
Energy-efficient acceleration of MPEG-4 compression tools
We propose novel hardware accelerator architectures for the most computationally demanding algorithms of the MPEG-4 video compression standard-motion estimation, binary motion estimation (for shape coding), and the forward/inverse discrete cosine transforms (incorporating shape adaptive modes). These accelerators have been designed using general low-energy design philosophies at the algorithmic/architectural abstraction levels. The themes of these philosophies are avoiding waste and trading area/performance for power and energy gains. Each core has been synthesised targeting TSMC 0.09
μm TCBN90LP technology, and the experimental results presented in this paper show that the proposed cores improve upon the prior art
Ultra-Fast, High-Performance 8x8 Approximate Multipliers by a New Multicolumn 3,3:2 Inexact Compressor and its Derivatives
Multiplier, as a key role in many different applications, is a
time-consuming, energy-intensive computation block. Approximate computing is a
practical design paradigm that attempts to improve hardware efficacy while
keeping computation quality satisfactory. A novel multicolumn 3,3:2 inexact
compressor is presented in this paper. It takes three partial products from two
adjacent columns each for rapid partial product reduction. The proposed inexact
compressor and its derivates enable us to design a high-speed approximate
multiplier. Then, another ultra-fast, high-efficient approximate multiplier is
achieved utilizing a systematic truncation strategy. The proposed multipliers
accumulate partial products in only two stages, one fewer stage than other
approximate multipliers in the literature. Implementation results by Synopsys
Design Compiler and 45 nm technology node demonstrates nearly 11.11% higher
speed for the second proposed design over the fastest existing approximate
multiplier. Furthermore, the new approximate multipliers are applied to the
image processing application of image sharpening, and their performance in this
application is highly satisfactory. It is shown in this paper that the error
pattern of an approximate multiplier, in addition to the mean error distance
and error rate, has a direct effect on the outcomes of the image processing
application.Comment: 21 Pages, 18 Figures, 6 Table
Partial Product Reduction based on Look-Up Tables
In this paper a new technique for partial product reduction based on the use of look-up tables for efficient processing is presented. We describe how to construct counter devices with pre-calculated data and their subsequent integration into the whole operation. The development of reduction trees organizations for this kind of devices uses the inherent integration benefits of computer memories and offers an alternative implementation to classic operation methods. Therefore, in our experiments we compare our implementation model with CMOS technology model in homogeneous terms
Low energy HEVC and VVC video compression hardware
Video compression standards compress a digital video by reducing and removing redundancy in the digital video using computationally complex algorithms. As spatial and temporal resolutions of videos increase, compression efficiencies of video compression algorithms are also increasing. However, increased compression efficiency comes with increased computational complexity. Therefore, it is necessary to reduce computational complexities of video compression algorithms without reducing their visual quality in order to reduce area and energy consumption of their hardware implementations. In this thesis, we propose a novel technique for reducing amount of computations performed by HEVC intra prediction algorithm. We designed low energy, reconfigurable HEVC intra prediction hardware using the proposed technique. We also designed a low energy FPGA implementation of HEVC intra prediction algorithm using the proposed technique and DSP blocks. We propose a reconfigurable VVC intra prediction hardware architecture. We also propose an efficient VVC intra prediction hardware architecture using DSP blocks. We designed low energy VVC fractional interpolation hardware. We propose a novel approximate absolute difference technique. We designed low energy approximate absolute difference hardware using the proposed technique. We propose a novel approximate constant multiplication technique. We designed approximate constant multiplication hardware using the proposed technique. We quantified computation reductions achieved by the proposed techniques and video quality loss caused by the proposed approximation techniques. The proposed approximate absolute difference technique and approximate constant multiplication technique cause very small PSNR loss. The other proposed techniques cause no PSNR loss. We implemented the proposed hardware architectures in Verilog HDL. We mapped the Verilog RTL codes to Xilinx Virtex 6 or Xilinx Virtex 7 FPGAs and estimated their power consumptions using Xilinx XPower Analyzer tool. The proposed techniques significantly reduced power and energy consumptions of these FPGA implementation
Low-Power, Low-Cost, & High-Performance Digital Designs : Multi-bit Signed Multiplier design using 32nm CMOS Technology
Binary multipliers are ubiquitous in digital hardware. Digital multipliers along with the adders play a major role in computing, communicating, and controlling devices. Multipliers are used majorly in the areas of digital signal and image processing, central processing unit (CPU) of the computers, high-performance and parallel scientific computing, machine learning, physical layer design of the communication equipment, etc. The predominant presence and increasing demand for low-power, low-cost, and high-performance digital hardware led to this work of developing optimized multiplier designs. Two optimized designs are proposed in this work. One is an optimized 8 x 8 Booth multiplier architecture which is implemented using 32nm CMOS technology. Synthesis (pre-layout) and post-layout results show that the delay is reduced by 24.7% and 25.6% respectively, the area is reduced by 5.5% and 15% respectively, the power consumption is reduced by 21.5% and 26.6% respectively, and the area-delay-product is reduced by 28.8% and 36.8% respectively when compared to the performance results obtained for the state-of-the-art 8 x 8 Booth multiplier designed using 32nm CMOS technology with 1.05 V supply voltage at 500 MHz input frequency. Another is a novel radix-8 structure with 3-bit grouping to reduce the number of partial products along with the effective partial product reduction schemes for 8 x 8, 16 x 16, 32 x 32, and 64 x 64 signed multipliers. Comparing the performance results of the (synthesized, post-layout) designs of sizes 32 x 32, and 64 x 64 based on the simple novel radix-8 structure with the estimated performance measurements for the optimized Booth multiplier design presented in this work, reduction in delay by (2.64%, 0.47%) and (2.74%, 18.04%) respectively, and reduction in area-delay-product by (12.12%, -5.17%) and (17.82%, 12.91%) respectively can be observed. With the use of the higher radix structure, delay, area, and power consumption can be further reduced. Appropriate adder deployment, further exploring the optimized grouping or compression strategies, and applying more low-power design techniques such as power-gating, multi-Vt MOS transistor utilization, multi-VDD domain creation, etc., help, along with the higher radix structures, realizing the more efficient multiplier designs
Spatial phase dislocations in femtosecond laser pulses
We show that spatial phase dislocations associated with optical vortices can be embedded in femtosecond laser beams by computer-generated holograms, provided that they are built in a setup compensating for the introduced spatial dispersion of the broad spectrum. We present analytical results describing two possible arrangements: a dispersionless 4 setup and a double-pass grating compressor. Experimental results on the generation of optical vortices in the output beam of a 20 fs Ti:sapphire laser and the proof-of-principle measurements with a broadband-tunable cw Ti:sapphire laser confirm our theoretical predictions.This research was partially supported by the National
Science Fund (Bulgaria), under contract F-1303/2003, and
the Australian Research Council
- …