1,055 research outputs found

    High throughput spatial convolution filters on FPGAs

    Get PDF
    Digital signal processing (DSP) on field- programmable gate arrays (FPGAs) has long been appealing because of the inherent parallelism in these computations that can be easily exploited to accelerate such algorithms. FPGAs have evolved significantly to further enhance the mapping of these algorithms, included additional hard blocks, such as the DSP blocks found in modern FPGAs. Although these DSP blocks can offer more efficient mapping of DSP computations, they are primarily designed for 1-D filter structures. We present a study on spatial convolutional filter implementations on FPGAs, optimizing around the structure of the DSP blocks to offer high throughput while maintaining the coefficient flexibility that other published architectures usually sacrifice. We show that it is possible to implement large filters for large 4K resolution image frames at frame rates of 30–60 FPS, while maintaining functional flexibility

    Assessing Approximate Arithmetic Designs in the presence of Process Variations and Voltage Scaling

    Get PDF
    As environmental concerns and portability of electronic devices move to the forefront of priorities, innovative approaches which reduce processor energy consumption are sought. Approximate arithmetic units are one of the avenues whereby significant energy savings can be achieved. Approximation of fundamental arithmetic units is achieved by judiciously reducing the number of transistors in the circuit. A satisfactory tradeoff of energy vs. accuracy of the circuit can be determined by trial-and-error methods of each functional approximation. Although the accuracy of the output is compromised, it is only decreased to an acceptable extent that can still fulfill processing requirements. A number of scenarios are evaluated with approximate arithmetic units to thoroughly cross-check them with their accurate counterparts. Some of the attributes evaluated are energy consumption, delay and process variation. Additionally, novel methods to create such approximate units are developed. One such method developed uses a Genetic Algorithm (GA), which mimics the biologically-inspired evolutionary techniques to obtain an optimal solution. A GA employs genetic operators such as crossover and mutation to mix and match several different types of approximate adders to find the best possible combination of such units for a given input set. As the GA usually consumes a significant amount of time as the size of the input set increases, we tackled this problem by using various methods to parallelize the fitness computation process of the GA, which is the most compute intensive task. The parallelization improved the computation time from 2,250 seconds to 1,370 seconds for up to 8 threads, using both OpenMP and Intel TBB. Apart from using the GA with seeded multiple approximate units, other seeds such as basic logic gates with limited logic space were used to develop completely new multi-bit approximate adders with good fitness levels. iii The effect of process variation was also calculated. As the number of transistors is reduced, the distribution of the transistor widths and gate oxide may shift away from a Gaussian Curve. This result was demonstrated in different types of single-bit adders with the delay sigma increasing from 6psec to 12psec, and when the voltage is scaled to Near-Threshold-Voltage (NTV) levels sigma increases by up to 5psec. Approximate Arithmetic Units were not affected greatly by the change in distribution of the thickness of the gate oxide. Even when considering the 3-sigma value, the delay of an approximate adder remains below that of a precise adder with additional transistors. Additionally, it is demonstrated that the GA obtains innovative solutions to the appropriate combination of approximate arithmetic units, to achieve a good balance between energy savings and accuracy

    Timing Measurement Platform for Arbitrary Black-Box Circuits Based on Transition Probability

    No full text

    High-Performance Accurate and Approximate Multipliers for FPGA-Based Hardware Accelerators

    Get PDF
    Multiplication is one of the widely used arithmetic operations in a variety of applications, such as image/video processing and machine learning. FPGA vendors provide high-performance multipliers in the form of DSP blocks. These multipliers are not only limited in number and have fixed locations on FPGAs but can also create additional routing delays and may prove inefficient for smaller bit-width multiplications. Therefore, FPGA vendors additionally provide optimized soft IP cores for multiplication. However, in this work, we advocate that these soft multiplier IP cores for FPGAs still need better designs to provide high-performance and resource efficiency. Toward this, we present generic area-optimized, low-latency accurate, and approximate softcore multiplier architectures, which exploit the underlying architectural features of FPGAs, i.e., lookup table (LUT) structures and fast-carry chains to reduce the overall critical path delay (CPD) and resource utilization of multipliers. Compared to Xilinx multiplier LogiCORE IP, our proposed unsigned and signed accurate architecture provides up to 25% and 53% reduction in LUT utilization, respectively, for different sizes of multipliers. Moreover, with our unsigned approximate multiplier architectures, a reduction of up to 51% in the CPD can be achieved with an insignificant loss in output accuracy when compared with the LogiCORE IP. For illustration, we have deployed the proposed multiplier architecture in accelerators used in image and video applications, and evaluated them for area and performance gains. Our library of accurate and approximate multipliers is opensource and available online at https://cfaed.tu-dresden.de/pd-downloads to fuel further research and development in this area, facilitate reproducible research, and thereby enabling a new research direction for the FPGA community

    Design and Analysis of Multiplexer based Approximate Adder for Low Power Applications

    Get PDF
    Low power consumption is crucial for error-acceptable multimedia devices, with picture compression approaches leveraging various digital processing architectures and algorithms. Humans can assemble useful information from partially inaccurate outputs in many multimedia applications. As a result, producing exact outputs is not required. The demand for an exact outcome is fading because new innovative systems are forgiving of faults. In the domain where error-tolerance is accepted, approximate computing is a new paradigm that relaxes the requirement for an accurate modeling while offering power, time, and delay benefits. Adders are an essential arithmetic module for regulating power and memory usage in digital systems. The recent implementation and use of approximate adders have been supported by trade-off characteristics such as delay, lower power consumption. This study examines the delay and power consumption of conventional and approximate adders. Also, a simple, fast, and power-efficient multiplexer-based approximate adder is proposed, and its performance outperforms the adders compared with existing adders. The proposed adder can be utilized in error-tolerant and various digital signal processing applications where exact results are not required. The proposed and existing adders are designed using EDA software for the performance calculations. With a delay of 81 pS, the proposed adder circuit reduces power consumption compared to the exact one. The experiment shows that the designed approximate adder can be used to implement circuits for image processing systems because it has a smaller delay and uses less energy
    corecore