31 research outputs found

    Approximate and timing-speculative hardware design for high-performance and energy-efficient video processing

    Get PDF
    Since the end of transistor scaling in 2-D appeared on the horizon, innovative circuit design paradigms have been on the rise to go beyond the well-established and ultraconservative exact computing. Many compute-intensive applications – such as video processing – exhibit an intrinsic error resilience and do not necessarily require perfect accuracy in their numerical operations. Approximate computing (AxC) is emerging as a design alternative to improve the performance and energy-efficiency requirements for many applications by trading its intrinsic error tolerance with algorithm and circuit efficiency. Exact computing also imposes a worst-case timing to the conventional design of hardware accelerators to ensure reliability, leading to an efficiency loss. Conversely, the timing-speculative (TS) hardware design paradigm allows increasing the frequency or decreasing the voltage beyond the limits determined by static timing analysis (STA), thereby narrowing pessimistic safety margins that conventional design methods implement to prevent hardware timing errors. Timing errors should be evaluated by an accurate gate-level simulation, but a significant gap remains: How these timing errors propagate from the underlying hardware all the way up to the entire algorithm behavior, where they just may degrade the performance and quality of service of the application at stake? This thesis tackles this issue by developing and demonstrating a cross-layer framework capable of performing investigations of both AxC (i.e., from approximate arithmetic operators, approximate synthesis, gate-level pruning) and TS hardware design (i.e., from voltage over-scaling, frequency over-clocking, temperature rising, and device aging). The cross-layer framework can simulate both timing errors and logic errors at the gate-level by crossing them dynamically, linking the hardware result with the algorithm-level, and vice versa during the evolution of the application’s runtime. Existing frameworks perform investigations of AxC and TS techniques at circuit-level (i.e., at the output of the accelerator) agnostic to the ultimate impact at the application level (i.e., where the impact is truly manifested), leading to less optimization. Unlike state of the art, the framework proposed offers a holistic approach to assessing the tradeoff of AxC and TS techniques at the application-level. This framework maximizes energy efficiency and performance by identifying the maximum approximation levels at the application level to fulfill the required good enough quality. This thesis evaluates the framework with an 8-way SAD (Sum of Absolute Differences) hardware accelerator operating into an HEVC encoder as a case study. Application-level results showed that the SAD based on the approximate adders achieve savings of up to 45% of energy/operation with an increase of only 1.9% in BD-BR. On the other hand, VOS (Voltage Over-Scaling) applied to the SAD generates savings of up to 16.5% in energy/operation with around 6% of increase in BD-BR. The framework also reveals that the boost of about 6.96% (at 50°) to 17.41% (at 75° with 10- Y aging) in the maximum clock frequency achieved with TS hardware design is totally lost by the processing overhead from 8.06% to 46.96% when choosing an unreliable algorithm to the blocking match algorithm (BMA). We also show that the overhead can be avoided by adopting a reliable BMA. This thesis also shows approximate DTT (Discrete Tchebichef Transform) hardware proposals by exploring a transform matrix approximation, truncation and pruning. The results show that the approximate DTT hardware proposal increases the maximum frequency up to 64%, minimizes the circuit area in up to 43.6%, and saves up to 65.4% in power dissipation. The DTT proposal mapped for FPGA shows an increase of up to 58.9% on the maximum frequency and savings of about 28.7% and 32.2% on slices and dynamic power, respectively compared with stat

    Optimisation énergétique de processus de traitement du signal et ses applications au décodage vidéo

    Get PDF
    Consumer electronics offer today more and more features (video, audio, GPS, Internet) and connectivity means (multi-radio systems with WiFi, Bluetooth, UMTS, HSPA, LTE-advanced ... ). The power demand of these devices is growing for the digital part especially for the processing chip. To support this ever increasing computing demand, processor architectures have evolved with multicore processors, graphics processors (GPU) and ether dedicated hardware accelerators. However, the evolution of battery technology is itself slower. Therefore, the autonomy of embedded systems is now under a great pressure. Among the new functionalities supported by mobile devices, video services take a prominent place. lndeed, recent analyzes show that they will represent 70% of mobile Internet traffic by 2016. Accompanying this growth, new technologies are emerging for new services and applications. Among them HEVC (High Efficiency Video Coding) can double the data compression while maintaining a subjective quality equivalent to its predecessor, the H.264 standard. ln a digital circuit, the total power consumption is made of static power and dynamic power. Most of modern hardware architectures implement means to control the power consumption of the system. Dynamic Voltage and Frequency Scaling (DVFS) mainly reduces the dynamic power of the circuit. This technique aims to adapt the power of the processor (and therefore its consumption) to the actual load needed by the application. To control the static power, Dynamic Power Management (DPM or sleep modes) aims to stop the voltage supplies associated with specific areas of the chip. ln this thesis, we first present a model of the energy consumed by the circuit integrating DPM and DVFS modes. This model is generalized to multi-core integrated circuits and to a rapid prototyping tool. Thus, the optimal operating point of a circuit, i.e. the operating frequency and the number of active cores, is identified. Secondly, the HEVC application is integrated to a multicore architecture coupled with a sophisticated DVFS mechanism. We show that this application can be implemented efficiently on general purpose processors (GPP) while minimizing the power consumption. Finally, and to get further energy gain, we propose a modified HEVC decoder that is capable to tune its energy gains together with a decoding quality trade-off.Aujourd'hui, les appareils électroniques offrent de plus en plus de fonctionnalités (vidéo, audio, GPS, internet) et des connectivités variées (multi-systèmes de radio avec WiFi, Bluetooth, UMTS, HSPA, LTE-advanced ... ). La demande en puissance de ces appareils est donc grandissante pour la partie numérique et notamment le processeur de calcul. Pour répondre à ce besoin sans cesse croissant de nouvelles fonctionnalités et donc de puissance de calcul, les architectures des processeurs ont beaucoup évolué : processeurs multi-coeurs, processeurs graphiques (GPU) et autres accélérateurs matériels dédiés. Cependant, alors que de nouvelles architectures matérielles peinent à répondre aux exigences de performance, l'évolution de la technologie des batteries est quant à elle encore plus lente. En conséquence, l'autonomie des systèmes embarqués est aujourd'hui sous pression. Parmi les nouveaux services supportés par les terminaux mobiles, la vidéo prend une place prépondérante. En effet, des analyses récentes de tendance montrent qu'elle représentera 70 % du trafic internet mobile dès 2016. Accompagnant cette croissance, de nouvelles technologies émergent permettant de nouveaux services et applications. Parmi elles, HEVC (High Efficiency Video Coding) permet de doubler la compression de données tout en garantissant une qualité subjective équivalente à son prédécesseur, la norme H.264. Dans un circuit numérique, la consommation provient de deux éléments: la puissance statique et la puissance dynamique. La plupart des architectures matérielles récentes mettent en oeuvre des procédés permettant de contrôler la puissance du système. Le changement dynamique du couple tension/fréquence appelé Dynamic Voltage and Frequency Scaling (DVFS) agit principalement sur la puissance dynamique du circuit. Cette technique permet d'adapter la puissance du processeur (et donc sa consommation) à la charge réelle nécessaire pour une application. Pour contrôler la puissance statique, le Dynamic Power Management (DPM, ou modes de veille) consistant à arrêter les alimentations associées à des zones spécifiques de la puce. Dans cette thèse, nous présentons d'abord une modélisation de l'énergie consommée par le circuit intégrant les modes DVFS et DPM. Cette modélisation est généralisée au circuit multi-coeurs et intégrée à un outil de prototypage rapide. Ainsi le point de fonctionnement optimal d'un circuit, la fréquence de fonctionnement et le nombre de coeurs actifs, est identifié. Dans un second temps, l'application HEVC est intégrée à une architecture multi-coeurs avec une adaptation dynamique de la fréquence de développement. Nous montrons que cette application peut être implémentée efficacement sur des processeurs généralistes (GPP) tout en minimisant la puissance consommée. Enfin, et pour aller plus loin dans les gains en énergie, nous proposons une modification du décodeur HEVC qui permet à un décodeur de baisser encore plus sa consommation en fonction du budget énergétique disponible localement

    High Performance Multiview Video Coding

    Get PDF
    Following the standardization of the latest video coding standard High Efficiency Video Coding in 2013, in 2014, multiview extension of HEVC (MV-HEVC) was published and brought significantly better compression performance of around 50% for multiview and 3D videos compared to multiple independent single-view HEVC coding. However, the extremely high computational complexity of MV-HEVC demands significant optimization of the encoder. To tackle this problem, this work investigates the possibilities of using modern parallel computing platforms and tools such as single-instruction-multiple-data (SIMD) instructions, multi-core CPU, massively parallel GPU, and computer cluster to significantly enhance the MVC encoder performance. The aforementioned computing tools have very different computing characteristics and misuse of the tools may result in poor performance improvement and sometimes even reduction. To achieve the best possible encoding performance from modern computing tools, different levels of parallelism inside a typical MVC encoder are identified and analyzed. Novel optimization techniques at various levels of abstraction are proposed, non-aggregation massively parallel motion estimation (ME) and disparity estimation (DE) in prediction unit (PU), fractional and bi-directional ME/DE acceleration through SIMD, quantization parameter (QP)-based early termination for coding tree unit (CTU), optimized resource-scheduled wave-front parallel processing for CTU, and workload balanced, cluster-based multiple-view parallel are proposed. The result shows proposed parallel optimization techniques, with insignificant loss to coding efficiency, significantly improves the execution time performance. This , in turn, proves modern parallel computing platforms, with appropriate platform-specific algorithm design, are valuable tools for improving the performance of computationally intensive applications

    Algorithms for compression of high dynamic range images and video

    Get PDF
    The recent advances in sensor and display technologies have brought upon the High Dynamic Range (HDR) imaging capability. The modern multiple exposure HDR sensors can achieve the dynamic range of 100-120 dB and LED and OLED display devices have contrast ratios of 10^5:1 to 10^6:1. Despite the above advances in technology the image/video compression algorithms and associated hardware are yet based on Standard Dynamic Range (SDR) technology, i.e. they operate within an effective dynamic range of up to 70 dB for 8 bit gamma corrected images. Further the existing infrastructure for content distribution is also designed for SDR, which creates interoperability problems with true HDR capture and display equipment. The current solutions for the above problem include tone mapping the HDR content to fit SDR. However this approach leads to image quality associated problems, when strong dynamic range compression is applied. Even though some HDR-only solutions have been proposed in literature, they are not interoperable with current SDR infrastructure and are thus typically used in closed systems. Given the above observations a research gap was identified in the need for efficient algorithms for the compression of still images and video, which are capable of storing full dynamic range and colour gamut of HDR images and at the same time backward compatible with existing SDR infrastructure. To improve the usability of SDR content it is vital that any such algorithms should accommodate different tone mapping operators, including those that are spatially non-uniform. In the course of the research presented in this thesis a novel two layer CODEC architecture is introduced for both HDR image and video coding. Further a universal and computationally efficient approximation of the tone mapping operator is developed and presented. It is shown that the use of perceptually uniform colourspaces for internal representation of pixel data enables improved compression efficiency of the algorithms. Further proposed novel approaches to the compression of metadata for the tone mapping operator is shown to improve compression performance for low bitrate video content. Multiple compression algorithms are designed, implemented and compared and quality-complexity trade-offs are identified. Finally practical aspects of implementing the developed algorithms are explored by automating the design space exploration flow and integrating the high level systems design framework with domain specific tools for synthesis and simulation of multiprocessor systems. The directions for further work are also presented

    Resource-Constrained Low-Complexity Video Coding for Wireless Transmission

    Get PDF

    Investigation of a Novel Common Subexpression Elimination Method for Low Power and Area Efficient DCT Architecture

    Get PDF
    A wide interest has been observed to find a low power and area efficient hardware design of discrete cosine transform (DCT) algorithm. This research work proposed a novel Common Subexpression Elimination (CSE) based pipelined architecture for DCT, aimed at reproducing the cost metrics of power and area while maintaining high speed and accuracy in DCT applications. The proposed design combines the techniques of Canonical Signed Digit (CSD) representation and CSE to implement the multiplier-less method for fixed constant multiplication of DCT coefficients. Furthermore, symmetry in the DCT coefficient matrix is used with CSE to further decrease the number of arithmetic operations. This architecture needs a single-port memory to feed the inputs instead of multiport memory, which leads to reduction of the hardware cost and area. From the analysis of experimental results and performance comparisons, it is observed that the proposed scheme uses minimum logic utilizing mere 340 slices and 22 adders. Moreover, this design meets the real time constraints of different video/image coders and peak-signal-to-noise-ratio (PSNR) requirements. Furthermore, the proposed technique has significant advantages over recent well-known methods along with accuracy in terms of power reduction, silicon area usage, and maximum operating frequency by 41%, 15%, and 15%, respectively

    Receiver-Driven Video Adaptation

    Get PDF
    In the span of a single generation, video technology has made an incredible impact on daily life. Modern use cases for video are wildly diverse, including teleconferencing, live streaming, virtual reality, home entertainment, social networking, surveillance, body cameras, cloud gaming, and autonomous driving. As these applications continue to grow more sophisticated and heterogeneous, a single representation of video data can no longer satisfy all receivers. Instead, the initial encoding must be adapted to each receiver's unique needs. Existing adaptation strategies are fundamentally flawed, however, because they discard the video's initial representation and force the content to be re-encoded from scratch. This process is computationally expensive, does not scale well with the number of videos produced, and throws away important information embedded in the initial encoding. Therefore, a compelling need exists for the development of new strategies that can adapt video content without fully re-encoding it. To better support the unique needs of smart receivers, diverse displays, and advanced applications, general-use video systems should produce and offer receivers a more flexible compressed representation that supports top-down adaptation strategies from an original, compressed-domain ground truth. This dissertation proposes an alternate model for video adaptation that addresses these challenges. The key idea is to treat the initial compressed representation of a video as the ground truth, and allow receivers to drive adaptation by dynamically selecting which subsets of the captured data to receive. In support of this model, three strategies for top-down, receiver-driven adaptation are proposed. First, a novel, content-agnostic entropy coding technique is implemented in which symbols are selectively dropped from an input abstract symbol stream based on their estimated probability distributions to hit a target bit rate. Receivers are able to guide the symbol dropping process by supplying the encoder with an appropriate rate controller algorithm that fits their application needs and available bandwidths. Next, a domain-specific adaptation strategy is implemented for H.265/HEVC coded video in which the prediction data from the original source is reused directly in the adapted stream, but the residual data is recomputed as directed by the receiver. By tracking the changes made to the residual, the encoder can compensate for decoder drift to achieve near-optimal rate-distortion performance. Finally, a fully receiver-driven strategy is proposed in which the syntax elements of a pre-coded video are cataloged and exposed directly to clients through an HTTP API. Instead of requesting the entire stream at once, clients identify the exact syntax elements they wish to receive using a carefully designed query language. Although an implementation of this concept is not provided, an initial analysis shows that such a system could save bandwidth and computation when used by certain targeted applications.Doctor of Philosoph

    Autonomous Recovery Of Reconfigurable Logic Devices Using Priority Escalation Of Slack

    Get PDF
    Field Programmable Gate Array (FPGA) devices offer a suitable platform for survivable hardware architectures in mission-critical systems. In this dissertation, active dynamic redundancy-based fault-handling techniques are proposed which exploit the dynamic partial reconfiguration capability of SRAM-based FPGAs. Self-adaptation is realized by employing reconfiguration in detection, diagnosis, and recovery phases. To extend these concepts to semiconductor aging and process variation in the deep submicron era, resilient adaptable processing systems are sought to maintain quality and throughput requirements despite the vulnerabilities of the underlying computational devices. A new approach to autonomous fault-handling which addresses these goals is developed using only a uniplex hardware arrangement. It operates by observing a health metric to achieve Fault Demotion using Recon- figurable Slack (FaDReS). Here an autonomous fault isolation scheme is employed which neither requires test vectors nor suspends the computational throughput, but instead observes the value of a health metric based on runtime input. The deterministic flow of the fault isolation scheme guarantees success in a bounded number of reconfigurations of the FPGA fabric. FaDReS is then extended to the Priority Using Resource Escalation (PURE) online redundancy scheme which considers fault-isolation latency and throughput trade-offs under a dynamic spare arrangement. While deep-submicron designs introduce new challenges, use of adaptive techniques are seen to provide several promising avenues for improving resilience. The scheme developed is demonstrated by hardware design of various signal processing circuits and their implementation on a Xilinx Virtex-4 FPGA device. These include a Discrete Cosine Transform (DCT) core, Motion Estimation (ME) engine, Finite Impulse Response (FIR) Filter, Support Vector Machine (SVM), and Advanced Encryption Standard (AES) blocks in addition to MCNC benchmark circuits. A iii significant reduction in power consumption is achieved ranging from 83% for low motion-activity scenes to 12.5% for high motion activity video scenes in a novel ME engine configuration. For a typical benchmark video sequence, PURE is shown to maintain a PSNR baseline near 32dB. The diagnosability, reconfiguration latency, and resource overhead of each approach is analyzed. Compared to previous alternatives, PURE maintains a PSNR within a difference of 4.02dB to 6.67dB from the fault-free baseline by escalating healthy resources to higher-priority signal processing functions. The results indicate the benefits of priority-aware resiliency over conventional redundancy approaches in terms of fault-recovery, power consumption, and resource-area requirements. Together, these provide a broad range of strategies to achieve autonomous recovery of reconfigurable logic devices under a variety of constraints, operating conditions, and optimization criteria
    corecore