Search CORE

80 research outputs found

Power-Aware Design Methodologies for FPGA-Based Implementation of Video Processing Systems

Author: Ngo Hau Trung
Publication venue: ODU Digital Commons
Publication date: 01/01/2007
Field of study

The increasing capacity and capabilities of FPGA devices in recent years provide an attractive option for performance-hungry applications in the image and video processing domain. FPGA devices are often used as implementation platforms for image and video processing algorithms for real-time applications due to their programmable structure that can exploit inherent spatial and temporal parallelism. While performance and area remain as two main design criteria, power consumption has become an important design goal especially for mobile devices. Reduction in power consumption can be achieved by reducing the supply voltage, capacitances, clock frequency and switching activities in a circuit. Switching activities can be reduced by architectural optimization of the processing cores such as adders, multipliers, multiply and accumulators (MACS), etc. This dissertation research focuses on reducing the switching activities in digital circuits by considering data dependencies in bit level, word level and block level neighborhoods in a video frame. The bit level data neighborhood dependency consideration for power reduction is illustrated in the design of pipelined array, Booth and log-based multipliers. For an array multiplier, operands of the multipliers are partitioned into higher and lower parts so that the probability of the higher order parts being zero or one increases. The gating technique for the pipelined approach deactivates part(s) of the multiplier when the above special values are detected. For the Booth multiplier, the partitioning and gating technique is integrated into the Booth recoding scheme. In addition, a delay correction strategy is developed for the Booth multiplier to reduce the switching activities of the sign extension part in the partial products. A novel architecture design for the computation of log and inverse-log functions for the reduction of power consumption in arithmetic circuits is also presented. This also utilizes the proposed partitioning and gating technique for further dynamic power reduction by reducing the switching activities. The word level and block level data dependencies for reducing the dynamic power consumption are illustrated by presenting the design of a 2-D convolution architecture. Here the similarities of the neighboring pixels in window-based operations of image and video processing algorithms are considered for reduced switching activities. A partitioning and detection mechanism is developed to deactivate the parallel architecture for window-based operations if higher order parts of the pixel values are the same. A neighborhood dependent approach (NDA) is incorporated with different window buffering schemes. Consideration of the symmetry property in filter kernels is also applied with the NDA method for further reduction of switching activities. The proposed design methodologies are implemented and evaluated in a FPGA environment. It is observed that the dynamic power consumption in FPGA-based circuit implementations is significantly reduced in bit level, data level and block level architectures when compared to state-of-the-art design techniques. A specific application for the design of a real-time video processing system incorporating the proposed design methodologies for low power consumption is also presented. An image enhancement application is considered and the proposed partitioning and gating, and NDA methods are utilized in the design of the enhancement system. Experimental results show that the proposed multi-level power aware methodology achieves considerable power reduction. Research work is progressing In utilizing the data dependencies in subsequent frames in a video stream for the reduction of circuit switching activities and thereby the dynamic power consumption

Old Dominion University

Recommended from our members

AN ARCHITECTURE EVALUATION AND IMPLEMENTATION OF A SOFT GPGPU FOR FPGAs

Author: Andryc Kevin
Publication venue: ScholarWorks@UMass Amherst
Publication date: 25/10/2018
Field of study

Embedded and mobile systems must be able to execute a variety of different types of code, often with minimal available hardware. Many embedded systems now come with a simple processor and an FPGA, but not more energy-hungry components, such as a GPGPU. In this dissertation we present FlexGrip, a soft architecture which allows for the execution of GPGPU code on an FPGA without the need to recompile the design. The architecture is optimized for FPGA implementation to effectively support the conditional and thread-based execution characteristics of GPGPU execution without FPGA design recompilation. This architecture supports direct CUDA compilation to a binary which is executable on the FPGA-based GPGPU. Our architecture is customizable, thus providing the FPGA designer with a selection of GPGPU cores which display performance versus area tradeoffs. This dissertation describes the FlexGrip architecture in detail and showcases the benefits by evaluating the design for a collection of five standard CUDA benchmarks which are compiled using standard GPGPU compilation tools. Speedups of 23x, on average, versus a MicroBlaze microprocessor are achieved for designs which take advantage of the conditional execution capabilities offered by FlexGrip. We also show FlexGrip can achieve an 80% average reduction of dynamic energy versus the MicroBlaze microprocessor. The dissertation furthers discussion by exploring application-customized versions of the soft GPGPU, thus exploiting the overlay architecture. We expand the architecture to multiple processors per GPGPU and optimizing away features which are not needed by certain classes of applications. These optimizations, which include the effective use of block RAMs and DSP blocks, are critical to the performance of FlexGrip. By implementing a 2 GPGPU design, we show speedups of 44x on average versus a MicroBlaze microprocessor. Application-customized versions of the soft GPGPU can be used to further reduce dynamic energy consumption by an average of 14%. To complete this thesis, we augmented a GPGPU cycle accurate simulator to emulate FlexGrip and evaluate different levels of cache design spaces. We show performance increases for select benchmarks, however, we also show that 64% and 45% of benchmarks exhibited performance decreases when L1D cache was enabled for the 1 SMP and 2 SMP configurations, and only one benchmark showed performance improvement when the L2 cache was enabled

ScholarWorks@UMass Amherst

Highly Automated Formal Verification of Arithmetic Circuits

Author: Sayed-Ahmed Amr
Publication venue
Publication date: 01/01/2017
Field of study

This dissertation investigates the problems of two distinctive formal verification techniques for verifying large scale multiplier circuits and proposes two approaches to overcome some of these problems. The first technique is equivalence checking based on recurrence relations, while the second one is the symbolic computation technique which is based on the theory of Gröbner bases. This investigation demonstrates that approaches based on symbolic computation have better scalability and more robustness than state-of-the-art equivalence checking techniques for verification of arithmetic circuits. According to this conclusion, the thesis leverages the symbolic computation technique to verify floating-point designs. It proposes a new algebraic equivalence checking, in contrast to classical combinational equivalence checking, the proposed technique is capable of checking the equivalence of two circuits which have different architectures of arithmetic units as well as control logic parts, e.g., floating-point multipliers

E-LIB Dokumentserver - Staats und Universitätsbibliothek Bremen

A Solder-Defined Computer Architecture for Backdoor and Malware Resistance

Author: Abel Marc W.
Publication venue: CORE Scholar
Publication date: 01/01/2022
Field of study

This research is about securing control of those devices we most depend on for integrity and confidentiality. An emerging concern is that complex integrated circuits may be subject to exploitable defects or backdoors, and measures for inspection and audit of these chips are neither supported nor scalable. One approach for providing a “supply chain firewall” may be to forgo such components, and instead to build central processing units (CPUs) and other complex logic from simple, generic parts. This work investigates the capability and speed ceiling when open-source hardware methodologies are fused with maker-scale assembly tools and visible-scale final inspection. The author has designed, and demonstrated in simulation, a 36-bit CPU and protected memory subsystem that use only synchronous static random access memory (SRAM) and trivial glue logic integrated circuits as components. The design presently lacks preemptive multitasking, ability to load firmware into the SRAMs used as logic elements, and input/output. Strategies are presented for adding these missing subsystems, again using only SRAM and trivial glue logic. A load-store architecture is employed with four clock cycles per instruction. Simulations indicate that a clock speed of at least 64 MHz is probable, corresponding to 16 million instructions per second (16 MIPS), despite the architecture containing no microprocessors, field programmable gate arrays, programmable logic devices, application specific integrated circuits, or other purchased complex logic. The lower speed, larger size, higher power consumption, and higher cost of an “SRAM minicomputer,” compared to traditional microcontrollers, may be offset by the fully open architecture—hardware and firmware—along with more rigorous user control, reliability, transparency, and auditability of the system. SRAM logic is also particularly well suited for building arithmetic logic units, and can implement complex operations such as population count, a hash function for associative arrays, or a pseudorandom number generator with good statistical properties in as few as eight clock cycles per 36-bit word processed. 36-bit unsigned multiplication can be implemented in software in 47 instructions or fewer (188 clock cycles). A general theory is developed for fast SRAM parallel multipliers should they be needed

CORE

Performance Aspects of Synthesizable Computing Systems

Author: Schleuniger Pascal
Publication venue: Technical University of Denmark
Publication date: 01/01/2014
Field of study

Online Research Database In Technology

Fast Fourier transforms on energy-efficient application-specific processors

Author: Pitkänen Teemu
Publication venue: Tampere University of Technology
Publication date: 01/01/2014
Field of study

Many of the current applications used in battery powered devices are from digital signal processing, telecommunication, and multimedia domains. Traditionally application-specific fixed-function circuits have been used in these designs in form of application-specific integrated circuits (ASIC) to reach the required performance and energy-efficiency. The complexity of these applications has increased over the years, thus the design complexity has increased even faster, which implies increased design time. At the same time, there are more and more standards to be supported, thus using optimised fixed-function implementations for all the functions in all the standards is impractical. The non-recurring engineering costs for integrated circuits have also increased significantly, so manufacturers can only afford fewer chip iterations. Although tailoring the circuit for a specific application provides the best performance and/or energy-efficiency, such approach lacks flexibility. E.g., if an error is found after the manufacturing, an expensive chip iteration is required. In addition, new functionalities cannot be added afterwards to support evolution of standards. Flexibility can be obtained with software based implementation technologies. Unfortunately, general-purpose processors do not provide the energy-efficiency of the fixed-function circuit designs. A useful trade-off between flexibility and performance is implementation based on application-specific processors (ASP) where programmability provides the flexibility and computational resources customised for the given application provide the performance. In this Thesis, application-specific processors are considered by using fast Fourier transform as the representative algorithm. The architectural template used here is transport triggered architecture (TTA) which resembles very long instruction word machines but the operand execution resembles data flow machines rather than traditional operand triggering. The developed TTA processors exploit inherent parallelism of the application. In addition, several characteristics of the application have been identified and those are exploited by developing customised functional units for speeding up the execution. Several customisations are proposed for the data path of the processor but it is also important to match the memory bandwidth to the computation speed. This calls for a memory organisation supporting parallel memory accesses. The proposed optimisations have been used to improve the energy-efficiency of the processor and experiments show that a programmable solution can have energy-efficiency comparable to fixed-function ASIC designs

Trepo - Institutional Repository of Tampere University

Energy efficient multi-granular arithmetic in a coarse-grain reconfigurable architecture

Author: Louwers S.T.
Publication venue
Publication date: 01/01/2016
Field of study

Repository TU/e

Pure OAI Repository

Energy efficient hardware acceleration of multimedia processing tools

Author: Kinane Andrew
Publication venue: Dublin City University. School of Electronic Engineering
Publication date: 01/01/2006
Field of study

The world of mobile devices is experiencing an ongoing trend of feature enhancement and generalpurpose multimedia platform convergence. This trend poses many grand challenges, the most pressing being their limited battery life as a consequence of delivering computationally demanding features. The envisaged mobile application features can be considered to be accelerated by a set of underpinning hardware blocks Based on the survey that this thesis presents on modem video compression standards and their associated enabling technologies, it is concluded that tight energy and throughput constraints can still be effectively tackled at algorithmic level in order to design re-usable optimised hardware acceleration cores. To prove these conclusions, the work m this thesis is focused on two of the basic enabling technologies that support mobile video applications, namely the Shape Adaptive Discrete Cosine Transform (SA-DCT) and its inverse, the SA-IDCT. The hardware architectures presented in this work have been designed with energy efficiency in mind. This goal is achieved by employing high level techniques such as redundant computation elimination, parallelism and low switching computation structures. Both architectures compare favourably against the relevant pnor art in the literature. The SA-DCT/IDCT technologies are instances of a more general computation - namely, both are Constant Matrix Multiplication (CMM) operations. Thus, this thesis also proposes an algorithm for the efficient hardware design of any general CMM-based enabling technology. The proposed algorithm leverages the effective solution search capability of genetic programming. A bonus feature of the proposed modelling approach is that it is further amenable to hardware acceleration. Another bonus feature is an early exit mechanism that achieves large search space reductions .Results show an improvement on state of the art algorithms with future potential for even greater savings

Irish Universities

DCU Online Research Access Service

Formal verification of a fully IEEE compliant floating point unit

Author: Jacobi Christian
Publication venue
Publication date: 23/09/2004
Field of study

In this thesis we describe the formal verification of a fully IEEE compliant floating point unit (FPU). The hardware is verified on the gate-level against a formalization of the IEEE standard. The verification is performed using the theorem proving system PVS. The FPU supports both single and double precision floating point numbers, normal and denormal numbers, all four IEEE rounding modes, and exceptions as required by the standard. Beside the verification of the combinatorial correctness of the FPUs we pipeline the FPUs to allow the integration into an out-of-order processor. We formally define the correctness criterion the pipelines must obey in order to work properly within the processor. We then describe a new methodology based on combining model checking and theorem proving for the verification of the pipelines.Die vorliegende Arbeit behandelt die formale Verifikation einer vollständig IEEE konformen Floating Point Unit (FPU). Die Hardware wird auf Gatter-Ebene gegen eine Formalisierung des IEEE Standards verifiziert. Zur Verifikation wird das Beweis-System PVS benutzt. Die FPU unterstützt Fließkommazahlen mit einfacher und doppelter Genauigkeit, normale und denormale Zahlen, alle vier Rundungsmodi und alle Exception-Signale. Neben der Verifikation der kombinatorischen Schaltkreise werden die FPUs gepipelined, um sie in einen Out-of-order Prozessor zu integrieren. Die Korrektheits- Kriterien, die die gepipelineten FPUs befolgen müssen, um im Prozessor korrekt zu arbeiten, werden formal definiert. Es wird eine neue Methode zur Verifikation solcher Pipelines beschrieben. Die Methode beruht auf der Kombination von Model-Checking und Theorem-Proving

Universaar

Acronym