Search CORE

450 research outputs found

A Genetic-algorithm-based Approach to the Design of DCT Hardware Accelerators

Author: Barbareschi M.
Barone S.
Bosio A.
Han J.
Traiola M.
Publication venue
Publication date: 01/01/2022
Field of study

As modern applications demand an unprecedented level of computational resources, traditional computing system design paradigms are no longer adequate to guarantee significant performance enhancement at an affordable cost. Approximate Computing (AxC) has been introduced as a potential candidate to achieve better computational performances by relaxing non-critical functional system specifications. In this article, we propose a systematic and high-abstraction-level approach allowing the automatic generation of near Pareto-optimal approximate configurations for a Discrete Cosine Transform (DCT) hardware accelerator. We obtain the approximate variants by using approximate operations, having configurable approximation degree, rather than full-precise ones. We use a genetic searching algorithm to find the appropriate tuning of the approximation degree, leading to optimal tradeoffs between accuracy and gains. Finally, to evaluate the actual HW gains, we synthesize non-dominated approximate DCT variants for two different target technologies, namely, Field Programmable Gate Arrays (FPGAs) and Application Specific Integrated Circuits (ASICs). Experimental results show that the proposed approach allows performing a meaningful exploration of the design space to find the best tradeoffs in a reasonable time. Indeed, compared to the state-of-the-art work on approximate DCT, the proposed approach allows an 18% average energy improvement while providing at the same time image quality improvement

Archivio della ricerca - Università degli studi di Napoli Federico II

Approximate and timing-speculative hardware design for high-performance and energy-efficient video processing

Author: Paim Guilherme Pereira
Publication venue
Publication date: 01/01/2021
Field of study

Since the end of transistor scaling in 2-D appeared on the horizon, innovative circuit design paradigms have been on the rise to go beyond the well-established and ultraconservative exact computing. Many compute-intensive applications – such as video processing – exhibit an intrinsic error resilience and do not necessarily require perfect accuracy in their numerical operations. Approximate computing (AxC) is emerging as a design alternative to improve the performance and energy-efficiency requirements for many applications by trading its intrinsic error tolerance with algorithm and circuit efficiency. Exact computing also imposes a worst-case timing to the conventional design of hardware accelerators to ensure reliability, leading to an efficiency loss. Conversely, the timing-speculative (TS) hardware design paradigm allows increasing the frequency or decreasing the voltage beyond the limits determined by static timing analysis (STA), thereby narrowing pessimistic safety margins that conventional design methods implement to prevent hardware timing errors. Timing errors should be evaluated by an accurate gate-level simulation, but a significant gap remains: How these timing errors propagate from the underlying hardware all the way up to the entire algorithm behavior, where they just may degrade the performance and quality of service of the application at stake? This thesis tackles this issue by developing and demonstrating a cross-layer framework capable of performing investigations of both AxC (i.e., from approximate arithmetic operators, approximate synthesis, gate-level pruning) and TS hardware design (i.e., from voltage over-scaling, frequency over-clocking, temperature rising, and device aging). The cross-layer framework can simulate both timing errors and logic errors at the gate-level by crossing them dynamically, linking the hardware result with the algorithm-level, and vice versa during the evolution of the application’s runtime. Existing frameworks perform investigations of AxC and TS techniques at circuit-level (i.e., at the output of the accelerator) agnostic to the ultimate impact at the application level (i.e., where the impact is truly manifested), leading to less optimization. Unlike state of the art, the framework proposed offers a holistic approach to assessing the tradeoff of AxC and TS techniques at the application-level. This framework maximizes energy efficiency and performance by identifying the maximum approximation levels at the application level to fulfill the required good enough quality. This thesis evaluates the framework with an 8-way SAD (Sum of Absolute Differences) hardware accelerator operating into an HEVC encoder as a case study. Application-level results showed that the SAD based on the approximate adders achieve savings of up to 45% of energy/operation with an increase of only 1.9% in BD-BR. On the other hand, VOS (Voltage Over-Scaling) applied to the SAD generates savings of up to 16.5% in energy/operation with around 6% of increase in BD-BR. The framework also reveals that the boost of about 6.96% (at 50°) to 17.41% (at 75° with 10- Y aging) in the maximum clock frequency achieved with TS hardware design is totally lost by the processing overhead from 8.06% to 46.96% when choosing an unreliable algorithm to the blocking match algorithm (BMA). We also show that the overhead can be avoided by adopting a reliable BMA. This thesis also shows approximate DTT (Discrete Tchebichef Transform) hardware proposals by exploring a transform matrix approximation, truncation and pruning. The results show that the approximate DTT hardware proposal increases the maximum frequency up to 64%, minimizes the circuit area in up to 43.6%, and saves up to 65.4% in power dissipation. The DTT proposal mapped for FPGA shows an increase of up to 58.9% on the maximum frequency and savings of about 28.7% and 32.2% on slices and dynamic power, respectively compared with stat

Lume 5.8

Bi-criteria Pipeline Mappings for Parallel Image Processing

Author: Anne Benoit
Anne Benoit
Harald Kosch
Harald Kosch
Supérieure Lyon
Supérieure Lyon
Veronika Rehn-sonigo
Veronika Rehn-sonigo
Yves Robert
Yves Robert
École Normale
École Normale
Publication venue
Publication date: 01/01/2008
Field of study

Mapping workflow applications onto parallel platforms is a challenging problem, even for simple application patterns such as pipeline graphs. Several antagonistic criteria should be optimized, such as throughput and latency (or a combination). Typical applications include digital image processing, where images are processed in steady-state mode. In this paper, we study the mapping of a particular image processing application, the JPEG encoding. Mapping pipelined JPEG encoding onto parallel platforms is useful for instance for encoding Motion JPEG images. As the bi-criteria mapping problem is NP-complete, we concentrate on the evaluation and performance of polynomial heuristics

arXiv.org e-Print Archive

HAL-ENS-LYON

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot

Precision-Energy-Throughput Scaling Of Generic Matrix Multiplication and Convolution Kernels Via Linear Projections

Author: Anam Mohammad Ashraful
Andreopoulos Yiannis
Whatmough Paul N.
Publication venue
Publication date: 01/01/2013
Field of study

Generic matrix multiplication (GEMM) and one-dimensional convolution/cross-correlation (CONV) kernels often constitute the bulk of the compute- and memory-intensive processing within image/audio recognition and matching systems. We propose a novel method to scale the energy and processing throughput of GEMM and CONV kernels for such error-tolerant multimedia applications by adjusting the precision of computation. Our technique employs linear projections to the input matrix or signal data during the top-level GEMM and CONV blocking and reordering. The GEMM and CONV kernel processing then uses the projected inputs and the results are accumulated to form the final outputs. Throughput and energy scaling takes place by changing the number of projections computed by each kernel, which in turn produces approximate results, i.e. changes the precision of the performed computation. Results derived from a voltage- and frequency-scaled ARM Cortex A15 processor running face recognition and music matching algorithms demonstrate that the proposed approach allows for 280%~440% increase of processing throughput and 75%~80% decrease of energy consumption against optimized GEMM and CONV kernels without any impact in the obtained recognition or matching accuracy. Even higher gains can be obtained if one is willing to tolerate some reduction in the accuracy of the recognition and matching applications

arXiv.org e-Print Archive

Crossref

UCL Discovery

1994 Science Information Management and Data Compression Workshop

Author: Tilton James C.
Publication venue
Publication date
Field of study

This document is the proceedings from the 'Science Information Management and Data Compression Workshop,' which was held on September 26-27, 1994, at the NASA Goddard Space Flight Center, Greenbelt, Maryland. The Workshop explored promising computational approaches for handling the collection, ingestion, archival and retrieval of large quantities of data in future Earth and space science missions. It consisted of eleven presentations covering a range of information management and data compression approaches that are being or have been integrated into actual or prototypical Earth or space science data information systems, or that hold promise for such an application. The workshop was organized by James C. Tilton and Robert F. Cromp of the NASA Goddard Space Flight Center

NASA Technical Reports Server

Error tolerant multimedia stream processing: There's plenty of room at the top (of the system stack)

Author: Andreopoulos Y
Publication venue
Publication date: 08/12/2012
Field of study

There is a growing realization that the expected fault rates and energy dissipation stemming from increases in CMOS integration will lead to the abandonment of traditional system reliability in favor of approaches that offer reliability to hardware-induced errors across the application, runtime support, architecture, device and integrated-circuit (IC) layers. Commercial stakeholders of multimedia stream processing (MSP) applications, such as information retrieval, stream mining systems, and high-throughput image and video processing systems already feel the strain of inadequate system-level scaling and robustness under the always-increasing user demand. While such applications can tolerate certain imprecision in their results, today's MSP systems do not support a systematic way to exploit this aspect for cross-layer system resilience. However, research is currently emerging that attempts to utilize the error-tolerant nature of MSP applications for this purpose. This is achieved by modifications to all layers of the system stack, from algorithms and software to the architecture and device layer, and even the IC digital logic synthesis itself. Unlike conventional processing that aims for worst-case performance and accuracy guarantees, error-tolerant MSP attempts to provide guarantees for the expected performance and accuracy. In this paper we review recent advances in this field from an MSP and a system (layer-by-layer) perspective, and attempt to foresee some of the components of future cross-layer error-tolerant system design that may influence the multimedia and the general computing landscape within the next ten years. © 1999-2012 IEEE

UCL Discovery

Lossy compression and real-time geovisualization for ultra-low bandwidth telemetry from untethered underwater vehicles

Author: Murphy Christopher A.
Publication venue: 'MBLWHOI Library'
Publication date: 01/01/2008
Field of study

Submitted in partial fulfillment of the requirements for the degree of Master of Science at the Massachusetts Institute of Technology and the Woods Hole Oceanographic Institution September 2008Oceanographic applications of robotics are as varied as the undersea environment itself. As underwater robotics moves toward the study of dynamic processes with multiple vehicles, there is an increasing need to distill large volumes of data from underwater vehicles and deliver it quickly to human operators. While tethered robots are able to communicate data to surface observers instantly, communicating discoveries is more difficult for untethered vehicles. The ocean imposes severe limitations on wireless communications; light is quickly absorbed by seawater, and tradeoffs between frequency, bitrate and environmental effects result in data rates for acoustic modems that are routinely as low as tens of bits per second. These data rates usually limit telemetry to state and health information, to the exclusion of mission-specific science data. In this thesis, I present a system designed for communicating and presenting science telemetry from untethered underwater vehicles to surface observers. The system's goals are threefold: to aid human operators in understanding oceanographic processes, to enable human operators to play a role in adaptively responding to mission-specific data, and to accelerate mission planning from one vehicle dive to the next. The system uses standard lossy compression techniques to lower required data rates to those supported by commercially available acoustic modems (O(10)-O(100) bits per second). As part of the system, a method for compressing time-series science data based upon the Discrete Wavelet Transform (DWT) is explained, a number of low-bitrate image compression techniques are compared, and a novel user interface for reviewing transmitted telemetry is presented. Each component is motivated by science data from a variety of actual Autonomous Underwater Vehicle (AUV) missions performed in the last year.National Science Foundation Center for Subsurface Sensing and Imaging (CenSSIS ERC

DSpace@MIT

Woods Hole Open Access Server

Recommended from our members

Efficient FPGA implementation and power modelling of image and signal processing IP cores

Author: Chandrasekaran Shrutisagar
Publication venue: Brunel University School of Engineering and Design PhD Theses
Publication date: 01/01/2007
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Field Programmable Gate Arrays (FPGAs) are the technology of choice in a number ofimage and signal processing application areas such as consumer electronics, instrumentation, medical data processing and avionics due to their reasonable energy consumption, high performance, security, low design-turnaround time and reconfigurability. Low power FPGA devices are also emerging as competitive solutions for mobile and thermally constrained platforms. Most computationally intensive image and signal processing algorithms also consume a lot of power leading to a number of issues including reduced mobility, reliability concerns and increased design cost among others. Power dissipation has become one of the most important challenges, particularly for FPGAs. Addressing this problem requires optimisation and awareness at all levels in the design flow. The key achievements of the work presented in this thesis are summarised here. Behavioural level optimisation strategies have been used for implementing matrix product and inner product through the use of mathematical techniques such as Distributed Arithmetic (DA) and its variations including offset binary coding, sparse factorisation and novel vector level transformations. Applications to test the impact of these algorithmic and arithmetic transformations include the fast Hadamard/Walsh transforms and Gaussian mixture models. Complete design space exploration has been performed on these cores, and where appropriate, they have been shown to clearly outperform comparable existing implementations. At the architectural level, strategies such as parallelism, pipelining and systolisation have been successfully applied for the design and optimisation of a number of cores including colour space conversion, finite Radon transform, finite ridgelet transform and circular convolution. A pioneering study into the influence of supply voltage scaling for FPGA based designs, used in conjunction with performance enhancing strategies such as parallelism and pipelining has been performed. Initial results are very promising and indicated significant potential for future research in this area. A key contribution of this work includes the development of a novel high level power macromodelling technique for design space exploration and characterisation of custom IP cores for FPGAs, called Functional Level Power Analysis and Modelling (FLPAM). FLPAM is scalable, platform independent and compares favourably with existing approaches. A hybrid, top-down design flow paradigm integrating FLPAM with commercially available design tools for systematic optimisation of IP cores has also been developed

Brunel University Research Archive

The Iray Light Transport Simulation and Rendering System

Author: Keller Alexander
Kettner Lutz
Korndörfer Johann
Raab Matthias
Seibert Daniel
van Antwerpen Dietger
Wächter Carsten
Publication venue
Publication date: 03/05/2017
Field of study

While ray tracing has become increasingly common and path tracing is well understood by now, a major challenge lies in crafting an easy-to-use and efficient system implementing these technologies. Following a purely physically-based paradigm while still allowing for artistic workflows, the Iray light transport simulation and rendering system allows for rendering complex scenes by the push of a button and thus makes accurate light transport simulation widely available. In this document we discuss the challenges and implementation choices that follow from our primary design decisions, demonstrating that such a rendering system can be made a practical, scalable, and efficient real-world application that has been adopted by various companies across many fields and is in use by many industry professionals today

arXiv.org e-Print Archive

Crossref