502 research outputs found

    Approximate and timing-speculative hardware design for high-performance and energy-efficient video processing

    Get PDF
    Since the end of transistor scaling in 2-D appeared on the horizon, innovative circuit design paradigms have been on the rise to go beyond the well-established and ultraconservative exact computing. Many compute-intensive applications – such as video processing – exhibit an intrinsic error resilience and do not necessarily require perfect accuracy in their numerical operations. Approximate computing (AxC) is emerging as a design alternative to improve the performance and energy-efficiency requirements for many applications by trading its intrinsic error tolerance with algorithm and circuit efficiency. Exact computing also imposes a worst-case timing to the conventional design of hardware accelerators to ensure reliability, leading to an efficiency loss. Conversely, the timing-speculative (TS) hardware design paradigm allows increasing the frequency or decreasing the voltage beyond the limits determined by static timing analysis (STA), thereby narrowing pessimistic safety margins that conventional design methods implement to prevent hardware timing errors. Timing errors should be evaluated by an accurate gate-level simulation, but a significant gap remains: How these timing errors propagate from the underlying hardware all the way up to the entire algorithm behavior, where they just may degrade the performance and quality of service of the application at stake? This thesis tackles this issue by developing and demonstrating a cross-layer framework capable of performing investigations of both AxC (i.e., from approximate arithmetic operators, approximate synthesis, gate-level pruning) and TS hardware design (i.e., from voltage over-scaling, frequency over-clocking, temperature rising, and device aging). The cross-layer framework can simulate both timing errors and logic errors at the gate-level by crossing them dynamically, linking the hardware result with the algorithm-level, and vice versa during the evolution of the application’s runtime. Existing frameworks perform investigations of AxC and TS techniques at circuit-level (i.e., at the output of the accelerator) agnostic to the ultimate impact at the application level (i.e., where the impact is truly manifested), leading to less optimization. Unlike state of the art, the framework proposed offers a holistic approach to assessing the tradeoff of AxC and TS techniques at the application-level. This framework maximizes energy efficiency and performance by identifying the maximum approximation levels at the application level to fulfill the required good enough quality. This thesis evaluates the framework with an 8-way SAD (Sum of Absolute Differences) hardware accelerator operating into an HEVC encoder as a case study. Application-level results showed that the SAD based on the approximate adders achieve savings of up to 45% of energy/operation with an increase of only 1.9% in BD-BR. On the other hand, VOS (Voltage Over-Scaling) applied to the SAD generates savings of up to 16.5% in energy/operation with around 6% of increase in BD-BR. The framework also reveals that the boost of about 6.96% (at 50°) to 17.41% (at 75° with 10- Y aging) in the maximum clock frequency achieved with TS hardware design is totally lost by the processing overhead from 8.06% to 46.96% when choosing an unreliable algorithm to the blocking match algorithm (BMA). We also show that the overhead can be avoided by adopting a reliable BMA. This thesis also shows approximate DTT (Discrete Tchebichef Transform) hardware proposals by exploring a transform matrix approximation, truncation and pruning. The results show that the approximate DTT hardware proposal increases the maximum frequency up to 64%, minimizes the circuit area in up to 43.6%, and saves up to 65.4% in power dissipation. The DTT proposal mapped for FPGA shows an increase of up to 58.9% on the maximum frequency and savings of about 28.7% and 32.2% on slices and dynamic power, respectively compared with stat

    Software-based Approximate Computation Of Signal Processing Tasks

    Get PDF
    This thesis introduces a new dimension in performance scaling of signal processing systems by proposing software frameworks that achieve increased processing throughput when producing approximate results. The first contribution of this work is a new theory for accelerated computation of multimedia processing based on the concept of tight packing (Chapter 2). Usage of this theory accelerates small-dynamic-range linear signal processing tasks (such as convolution and transform decomposition) that map integers to integers, without incurring any accuracy loss. The concept of tight packing is combined with incremental computation that processes inputs in a bitplane-by-bitplane manner (Chapter 3), thereby leading to substantial throughput/distortion scalability within filtering, transform-decomposition and motion-estimation tasks. This framework also provides for region-of-interest computation and has inherent robustness to arbitrary termination of processing, imposed, for example, by a task scheduler. Finally, the concept of packed processing is extended to floating-point (lossy) matrix computations, with particular focus on the generic matrix multiplication (GEMM) routine of BLAS-3 (Chapters 4 and 5). This routine is a fundamental building block for several linear algebra and digital signal processing systems, such as face recognition and neural-network training for metadata-based retrieval systems. In order to compete with the best-performing software designs for GEMM, an implementation using single instruction, multiple data (SIMD) instructions is presented and analyzed. The proposed approach demonstrates substantial performance scaling in practice; specifically, it is shown to achieve up to twice the processing throughput of the best designs for GEMM when producing approximate results (under the same hardware). In summary, the proposed approximate computation of signal processing tasks can be selectively disabled thereby producing conventional full-precision/lower-throughput processing when deemed necessary. Importantly, the proposed software designs run on off-the-shelf computer hardware and provide for on-demand reconfiguration, depending on the input data and the precision specification (from full precision to noisy computation). Thus, the proposed approximate computation framework allows for backward compatibility and can be offered as an add-on service, creating significant competitive advantages for application developers. It can be used in mobile or high-performance computing systems when the precision of computation is not of critical importance (error-tolerant systems), or when the input data is intrinsically noisy

    Quantification and segmentation of breast cancer diagnosis: efficient hardware accelerator approach

    Get PDF
    The mammography image eccentric area is the breast density percentage measurement. The technical challenge of quantification in radiology leads to misinterpretation in screening. Data feedback from society, institutional, and industry shows that quantification and segmentation frameworks have rapidly become the primary methodologies for structuring and interpreting mammogram digital images. Segmentation clustering algorithms have setbacks on overlapping clusters, proportion, and multidimensional scaling to map and leverage the data. In combination, mammogram quantification creates a long-standing focus area. The algorithm proposed must reduce complexity and target data points distributed in iterative, and boost cluster centroid merged into a single updating process to evade the large storage requirement. The mammogram database's initial test segment is critical for evaluating performance and determining the Area Under the Curve (AUC) to alias with medical policy. In addition, a new image clustering algorithm anticipates the need for largescale serial and parallel processing. There is no solution on the market, and it is necessary to implement communication protocols between devices. Exploiting and targeting utilization hardware tasks will further extend the prospect of improvement in the cluster. Benchmarking their resources and performance is required. Finally, the medical imperatives cluster was objectively validated using qualitative and quantitative inspection. The proposed method should overcome the technical challenges that radiologists face

    Design of large polyphase filters in the Quadratic Residue Number System

    Full text link

    Study and development of innovative strategies for energy-efficient cross-layer design of digital VLSI systems based on Approximate Computing

    Get PDF
    The increasing demand on requirements for high performance and energy efficiency in modern digital systems has led to the research of new design approaches that are able to go beyond the established energy-performance tradeoff. Looking at scientific literature, the Approximate Computing paradigm has been particularly prolific. Many applications in the domain of signal processing, multimedia, computer vision, machine learning are known to be particularly resilient to errors occurring on their input data and during computation, producing outputs that, although degraded, are still largely acceptable from the point of view of quality. The Approximate Computing design paradigm leverages the characteristics of this group of applications to develop circuits, architectures, algorithms that, by relaxing design constraints, perform their computations in an approximate or inexact manner reducing energy consumption. This PhD research aims to explore the design of hardware/software architectures based on Approximate Computing techniques, filling the gap in literature regarding effective applicability and deriving a systematic methodology to characterize its benefits and tradeoffs. The main contributions of this work are: -the introduction of approximate memory management inside the Linux OS, allowing dynamic allocation and de-allocation of approximate memory at user level, as for normal exact memory; - the development of an emulation environment for platforms with approximate memory units, where faults are injected during the simulation based on models that reproduce the effects on memory cells of circuital and architectural techniques for approximate memories; -the implementation and analysis of the impact of approximate memory hardware on real applications: the H.264 video encoder, internally modified to allocate selected data buffers in approximate memory, and signal processing applications (digital filter) using approximate memory for input/output buffers and tap registers; -the development of a fully reconfigurable and combinatorial floating point unit, which can work with reduced precision formats

    Discrete Wavelet Transforms

    Get PDF
    The discrete wavelet transform (DWT) algorithms have a firm position in processing of signals in several areas of research and industry. As DWT provides both octave-scale frequency and spatial timing of the analyzed signal, it is constantly used to solve and treat more and more advanced problems. The present book: Discrete Wavelet Transforms: Algorithms and Applications reviews the recent progress in discrete wavelet transform algorithms and applications. The book covers a wide range of methods (e.g. lifting, shift invariance, multi-scale analysis) for constructing DWTs. The book chapters are organized into four major parts. Part I describes the progress in hardware implementations of the DWT algorithms. Applications include multitone modulation for ADSL and equalization techniques, a scalable architecture for FPGA-implementation, lifting based algorithm for VLSI implementation, comparison between DWT and FFT based OFDM and modified SPIHT codec. Part II addresses image processing algorithms such as multiresolution approach for edge detection, low bit rate image compression, low complexity implementation of CQF wavelets and compression of multi-component images. Part III focuses watermaking DWT algorithms. Finally, Part IV describes shift invariant DWTs, DC lossless property, DWT based analysis and estimation of colored noise and an application of the wavelet Galerkin method. The chapters of the present book consist of both tutorial and highly advanced material. Therefore, the book is intended to be a reference text for graduate students and researchers to obtain state-of-the-art knowledge on specific applications

    Temperature aware power optimization for multicore floating-point units

    Full text link

    Neuromorphic Engineering Editors' Pick 2021

    Get PDF
    This collection showcases well-received spontaneous articles from the past couple of years, which have been specially handpicked by our Chief Editors, Profs. André van Schaik and Bernabé Linares-Barranco. The work presented here highlights the broad diversity of research performed across the section and aims to put a spotlight on the main areas of interest. All research presented here displays strong advances in theory, experiment, and methodology with applications to compelling problems. This collection aims to further support Frontiers’ strong community by recognizing highly deserving authors

    Efficient Algorithms for Large-Scale Image Analysis

    Get PDF
    This work develops highly efficient algorithms for analyzing large images. Applications include object-based change detection and screening. The algorithms are 10-100 times as fast as existing software, sometimes even outperforming FGPA/GPU hardware, because they are designed to suit the computer architecture. This thesis describes the implementation details and the underlying algorithm engineering methodology, so that both may also be applied to other applications
    • …
    corecore