6 research outputs found

    Recursive algorithm, architectures and FPGA implementation of the two-dimensional discrete cosine transform

    Get PDF
    In this paper, a new recursive algorithm and two types of circuit architectures are presented for the computation of the two dimensional discrete cosine transform (2-D DCT). The new algorithm permits to compute the 2-D DCT by a simple procedure of the 1-D recursive calculations involving only cosine coefficients. The recursive kernel for the proposed algorithm contains a small number of operations. Also, it requires a smaller number of pre-computed data compared to many of existing algorithms in the same category. The kernel can be easily implemented in a simple circuit block with a short critical delay path. In order to evaluate the performance improvement resulting from the new algorithm, an architecture for the 2-D DCT designed by direct mapping from the computation structure of the proposed algorithm has been implemented in a FPGA board. The results show that the reduction of the hardware consumption can easily reach 25% and the clock frequency can increase 17% compared to a system implementing a recently reported 2-D DCT recursive algorithm. For a further reduction of the hardware, another architecture has been proposed for the same 2-D DCT computation. Using one recursive computation block to perform different functions, this architecture needs only approximately one half of the hardware that is required in the first architecture, which has been confirmed by a FPGA implementation

    Reconfigurable Computing For Video Coding

    Get PDF
    Video coding is widely used in our daily life. Due to its high computational complexity, hardware implementation is usually preferred. In this research, we investigate both ASIC hardware design approach and reconfigurable hardware design approach for video coding applications. First, we present a unified architecture that can perform Discrete Cosine Transform (DCT), Inverse Discrete Cosine Transform (IDCT), DCT domain motion estimation and compensation (DCT-ME/MC). Our proposed architecture is a Wavefront Array-based Processor with a highly modular structure consisting of 8*8 Processing Elements (PEs). By utilizing statistical properties and arithmetic operations, it can be used as a high performance hardware accelerator for video transcoding applications. We show how different core algorithms can be mapped onto the same hardware fabric and can be executed through the pre-defined PEs. In addition to the simplified design process of the proposed architecture and savings of the hardware resources, we also demonstrate that high throughput rate can be achieved for IDCT and DCT-MC by fully utilizing the sparseness property of DCT coefficient matrix. Compared to fixed hardware architecture using ASIC design approach, reconfigurable hardware design approach has higher flexibility, lower cost, and faster time-to-market. We propose a self-reconfigurable platform which can reconfigure the architecture of DCT computations during run-time using dynamic partial reconfiguration. The scalable architecture for DCT computations can compute different number of DCT coefficients in the zig-zag scan order to adapt to different requirements, such as power consumption, hardware resource, and performance. We propose a configuration manager which is implemented in the embedded processor in order to adaptively control the reconfiguration of scalable DCT architecture during run-time. In addition, we use LZSS algorithm for compression of the partial bitstreams and on-chip BlockRAM as a cache to reduce latency overhead for loading the partial bitstreams from the off-chip memory for run-time reconfiguration. A hardware module is designed for parallel reconfiguration of the partial bitstreams. The experimental results show that our approach can reduce the external memory accesses by 69% and can achieve 400 MBytes/s reconfiguration rate. Detailed trade-offs of power, throughput, and quality are investigated, and used as a criterion for self-reconfiguration. Prediction algorithm of zero quantized DCT (ZQDCT) to control the run-time reconfiguration of the proposed scalable architecture has been used, and 12 different modes of DCT computations including zonal coding, multi-block processing, and parallel-sequential stage modes are supported to reduce power consumptions, required hardware resources, and computation time with a small quality degradation. Detailed trade-offs of power, throughput, and quality are investigated, and used as a criterion for self-reconfiguration to meet the requirements set by the users

    Practical Techniques for Improving Performance and Evaluating Security on Circuit Designs

    Get PDF
    As the modern semiconductor technology approaches to nanometer era, integrated circuits (ICs) are facing more and more challenges in meeting performance demand and security. With the expansion of markets in mobile and consumer electronics, the increasing demands require much faster delivery of reliable and secure IC products. In order to improve the performance and evaluate the security of emerging circuits, we present three practical techniques on approximate computing, split manufacturing and analog layout automation. Approximate computing is a promising approach for low-power IC design. Although a few accuracy-configurable adder (ACA) designs have been developed in the past, these designs tend to incur large area overheads as they rely on either redundant computing or complicated carry prediction. We investigate a simple ACA design that contains no redundancy or error detection/correction circuitry and uses very simple carry prediction. The simulation results show that our design dominates the latest previous work on accuracy-delay-power tradeoff while using 39% less area. One variant of this design provides finer-grained and larger tunability than that of the previous works. Moreover, we propose a delay-adaptive self-configuration technique to further improve the accuracy-delay-power tradeoff. Split manufacturing prevents attacks from an untrusted foundry. The untrusted foundry has front-end-of-line (FEOL) layout and the original circuit netlist and attempts to identify critical components on the layout for Trojan insertion. Although defense methods for this scenario have been developed, the corresponding attack technique is not well explored. Hence, the defense methods are mostly evaluated with the k-security metric without actual attacks. We develop a new attack technique based on structural pattern matching. Experimental comparison with existing attack shows that the new attack technique achieves about the same success rate with much faster speed for cases without the k-security defense, and has a much better success rate at the same runtime for cases with the k-security defense. The results offer an alternative and practical interpretation for k-security in split manufacturing. Analog layout automation is still far behind its digital counterpart. We develop the layout automation framework for analog/mixed-signal ICs. A hierarchical layout synthesis flow which works in bottom-up manner is presented. To ensure the qualified layouts for better circuit performance, we use the constraint-driven placement and routing methodology which employs the expert knowledge via design constraints. The constraint-driven placement uses simulated annealing process to find the optimal solution. The packing represented by sequence pairs and constraint graphs can simultaneously handle different kinds of placement constraints. The constraint-driven routing consists of two stages, integer linear programming (ILP) based global routing and sequential detailed routing. The experiment results demonstrate that our flow can handle complicated hierarchical designs with multiple design constraints. Furthermore, the placement performance can be further improved by using mixed-size block placement which works on large blocks in priority

    Efficient VLSI Architectures for Image Compression Algorithms

    Get PDF
    An image, in its original form, contains huge amount of data which demands not only large amount of memory requirements for its storage but also causes inconvenient transmission over limited bandwidth channel. Image compression reduces the data from the image in either lossless or lossy way. While lossless image compression retrieves the original image data completely, it provides very low compression. Lossy compression techniques compress the image data in variable amount depending on the quality of image required for its use in particular application area. It is performed in steps such as image transformation, quantization and entropy coding. JPEG is one of the most used image compression standard which uses discrete cosine transform (DCT) to transform the image from spatial to frequency domain. An image contains low visual information in its high frequencies for which heavy quantization can be done in order to reduce the size in the transformed representation. Entropy coding follows to further reduce the redundancy in the transformed and quantized image data. Real-time data processing requires high speed which makes dedicated hardware implementation most preferred choice. The hardware of a system is favored by its lowcost and low-power implementation. These two factors are also the most important requirements for the portable devices running on battery such as digital camera. Image transform requires very high computations and complete image compression system is realized through various intermediate steps between transform and final bit-streams. Intermediate stages require memory to store intermediate results. The cost and power of the design can be reduced both in efficient implementation of transforms and reduction/removal of intermediate stages by employing different techniques. The proposed research work is focused on the efficient hardware implementation of transform based image compression algorithms by optimizing the architecture of the system. Distribute arithmetic (DA) is an efficient approach to implement digital signal processing algorithms. DA is realized by two different ways, one through storage of precomputed values in ROMs and another without ROM requirements. ROM free DA is more efficient. For the image transform, architectures of one dimensional discrete Hartley transform (1-D DHT) and one dimensional DCT (1-D DCT) have been optimized using ROM free DA technique. Further, 2-D separable DHT (SDHT) and 2-D DCT architectures have been implemented in row-column approach using two 1-D DHT and two 1-D DCT respectively. A finite state machine (FSM) based architecture from DCT to quantization has been proposed using the modified quantization matrix in JPEG image compression which requires no memory in storage of quantization table and DCT coefficients. In addition, quantization is realized without use of multipliers that require more area and are power hungry. For the entropy encoding, Huffman coding is hardware efficient than arithmetic coding. The use of Huffman code table further simplifies the implementation. The strategies have been used for the significant reduction of memory bits in storage of Huffman code table and the complete Huffman coding architecture encodes the transformed coefficients one bit per clock cycle. Direct implementation algorithm of DCT has the advantage that it is free of transposition memory to store intermediate 1-D DCT. Although recursive algorithms have been a preferred method, these algorithms have low accuracy resulting in image quality degradation. A non-recursive equation for the direct computation of DCT coefficients have been proposed and implemented in both 0.18 µm ASIC library as well as FPGA. It can compute DCT coefficients in any order and all intermediate computations are free of fractions and hence very high image quality has been obtained in terms of PSNR. In addition, one multiplier and one register bit-width need to be changed for increasing the accuracy resulting in very low hardware overhead. The architecture implementation has been done to obtain zig-zag ordered DCT coefficients. The comparison results show that this implementation has less area in terms of gate counts and less power consumption than the existing DCT implementations. Using this architecture, the complete JPEG image compression system has been implemented which has Huffman coding module, one multiplier and one register as the only additional modules. The intermediate stages (DCT to Huffman encoding) are free of memory, hence efficient architecture is obtained

    Practical Techniques for Improving Performance and Evaluating Security on Circuit Designs

    Get PDF
    As the modern semiconductor technology approaches to nanometer era, integrated circuits (ICs) are facing more and more challenges in meeting performance demand and security. With the expansion of markets in mobile and consumer electronics, the increasing demands require much faster delivery of reliable and secure IC products. In order to improve the performance and evaluate the security of emerging circuits, we present three practical techniques on approximate computing, split manufacturing and analog layout automation. Approximate computing is a promising approach for low-power IC design. Although a few accuracy-configurable adder (ACA) designs have been developed in the past, these designs tend to incur large area overheads as they rely on either redundant computing or complicated carry prediction. We investigate a simple ACA design that contains no redundancy or error detection/correction circuitry and uses very simple carry prediction. The simulation results show that our design dominates the latest previous work on accuracy-delay-power tradeoff while using 39% less area. One variant of this design provides finer-grained and larger tunability than that of the previous works. Moreover, we propose a delay-adaptive self-configuration technique to further improve the accuracy-delay-power tradeoff. Split manufacturing prevents attacks from an untrusted foundry. The untrusted foundry has front-end-of-line (FEOL) layout and the original circuit netlist and attempts to identify critical components on the layout for Trojan insertion. Although defense methods for this scenario have been developed, the corresponding attack technique is not well explored. Hence, the defense methods are mostly evaluated with the k-security metric without actual attacks. We develop a new attack technique based on structural pattern matching. Experimental comparison with existing attack shows that the new attack technique achieves about the same success rate with much faster speed for cases without the k-security defense, and has a much better success rate at the same runtime for cases with the k-security defense. The results offer an alternative and practical interpretation for k-security in split manufacturing. Analog layout automation is still far behind its digital counterpart. We develop the layout automation framework for analog/mixed-signal ICs. A hierarchical layout synthesis flow which works in bottom-up manner is presented. To ensure the qualified layouts for better circuit performance, we use the constraint-driven placement and routing methodology which employs the expert knowledge via design constraints. The constraint-driven placement uses simulated annealing process to find the optimal solution. The packing represented by sequence pairs and constraint graphs can simultaneously handle different kinds of placement constraints. The constraint-driven routing consists of two stages, integer linear programming (ILP) based global routing and sequential detailed routing. The experiment results demonstrate that our flow can handle complicated hierarchical designs with multiple design constraints. Furthermore, the placement performance can be further improved by using mixed-size block placement which works on large blocks in priority
    corecore