Search CORE

9 research outputs found

Optimizing HEVC CABAC decoding with a context model cache and application-specific prefetching

Author: Chi Chi Ching
Habermann Philipp
Juurlink Ben
Álvarez-Mesa Mauricio
Publication venue
Publication date: 01/01/2015
Field of study

Context-based Adaptive Binary Arithmetic Coding is the entropy coding module in the most recent JCT-VC video coding standard HEVC/H.265. As in the predecessor H.264/AVC, CABAC is a well-known throughput bottleneck due to its strong data dependencies. Beside other optimizations, the replacement of the context model memory by a smaller cache has been proposed, resulting in an improved clock frequency. However, the effect of potential cache misses has not been properly evaluated. Our work fills this gap and performs an extensive evaluation of different cache configurations. Furthermore, it is demonstrated that application-specific context model prefetching can effectively reduce the miss rate and make it negligible. Best overall performance results were achieved with caches of two and four lines, where each cache line consists of four context models. Four cache lines allow a speed-up of 10% to 12% for all video configurations while two cache lines improve the throughput by 9% to 15% for high bitrate videos and by 1% to 4% for low bitrate videos.EC/H2020/645500/EU/Improving European VoD Creative Industry with High Efficiency Video Delivery/Film26

DepositOnce

Application-Specific Cache and Prefetching for HEVC CABAC Decoding

Author: Chi Chi Ching
Habermann Philipp
Juurlink Ben
Álvarez-Mesa Mauricio
Publication venue
Publication date: 01/01/2017
Field of study

Context-based Adaptive Binary Arithmetic Coding (CABAC) is the entropy coding module in the HEVC/H.265 video coding standard. As in its predecessor, H.264/AVC, CABAC is a well-known throughput bottleneck due to its strong data dependencies. Besides other optimizations, the replacement of the context model memory by a smaller cache has been proposed for hardware decoders, resulting in an improved clock frequency. However, the effect of potential cache misses has not been properly evaluated. This work fills the gap by performing an extensive evaluation of different cache configurations. Furthermore, it demonstrates that application-specific context model prefetching can effectively reduce the miss rate and increase the overall performance. The best results are achieved with two cache lines consisting of four or eight context models. The 2 × 8 cache allows a performance improvement of 13.2 percent to 16.7 percent compared to a non-cached decoder due to a 17 percent higher clock frequency and highly effective prefetching. The proposed HEVC/H.265 CABAC decoder allows the decoding of high-quality Full HD videos in real-time using few hardware resources on a low-power FPGA.EC/H2020/645500/EU/Improving European VoD Creative Industry with High Efficiency Video Delivery/Film26

DepositOnce

Parallel HEVC Decoding on Multi- and Many-core Architectures : A Power and Performance Analysis

Author: Chi Chi Ching
Juurlink Ben
Lucas Jan
Schierl Thomas
Álvarez-Mesa Mauricio
Publication venue
Publication date: 01/01/2013
Field of study

The Joint Collaborative Team on Video Decoding is developing a new standard named High Efficiency Video Coding (HEVC) that aims at reducing the bitrate of H.264/AVC by another 50 %. In order to fulfill the computational demands of the new standard, in particular for high resolutions and at low power budgets, exploiting parallelism is no longer an option but a requirement. Therefore, HEVC includes several coding tools that allows to divide each picture into several partitions that can be processed in parallel, without degrading the quality nor the bitrate. In this paper we adapt one of these approaches, the Wavefront Parallel Processing (WPP) coding, and show how it can be implemented on multi- and many-core processors. Our approach, named Overlapped Wavefront (OWF), processes several partitions as well as several pictures in parallel. This has the advantage that the amount of (thread-level) parallelism stays constant during execution. In addition, performance and power results are provided for three platforms: a server Intel CPU with 8 cores, a laptop Intel CPU with 4 cores, and a TILE-Gx36 with 36 cores from Tilera. The results show that our parallel HEVC decoder is capable of achieving an average frame rate of 116 fps for 4k resolution on a standard multicore CPU. The results also demonstrate that exploiting more parallelism by increasing the number of cores can improve the energy efficiency measured in terms of Joules per frame substantially

DepositOnce

Video Coding Performance

Author: Shahriar Akramullah
Publication venue: Apress
Publication date: 01/01/2014
Field of study

Springer - Publisher Connector

Circuit implementations for high-efficiency video coding tools

Author: Tikekar Mehul (Mehul Deepak)
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2012
Field of study

Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 71-72).High-Efficiency Video Coding (HEVC) is planned to be the successor video standard to the popular Advanced Video Coding (H.264/AVC) with a targeted 2x improvement in compression at the same quality. This improvement comes at the cost of increased complexity through the addition of new coding tools and increased computation in existing tools. The ever-increasing demand for higher resolution video further adds to the computation cost. In this work, digital circuits for two HEVC tools - inverse transform and deblocking filter are implemented to support Quad-Full HD (4K x 2K) video decoding at 30fps. Techniques to reduce power and area cost are investigated and synthesis results in 40nm CMOS technology and Virtex-6 FPGA platform are presented.by Mehul Tikekar.S.M

DSpace@MIT

Performance, Power, and Quality Tradeoff Analysis

Author: Shahriar Akramullah
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Springer - Publisher Connector

Towards Computational Efficiency of Next Generation Multimedia Systems

Author: Khan Muhammad Usman Karim
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2015
Field of study

To address throughput demands of complex applications (like Multimedia), a next-generation system designer needs to co-design and co-optimize the hardware and software layers. Hardware/software knobs must be tuned in synergy to increase the throughput efficiency. This thesis provides such algorithmic and architectural solutions, while considering the new technology challenges (power-cap and memory aging). The goal is to maximize the throughput efficiency, under timing- and hardware-constraints

KITopen

Architectures for Adaptive Low-Power Embedded Multimedia Systems

Author: Shafique Muhammad
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2011
Field of study

This Ph.D. thesis describes novel hardware/software architectures for adaptive low-power embedded multimedia systems. Novel techniques for run-time adaptive energy management are proposed, such that both HW & SW adapt together to react to the unpredictable scenarios. A complete power-aware H.264 video encoder was developed. Comparison with state-of-the-art demonstrates significant energy savings while meeting the performance constraint and keeping the video quality degradation unnoticeable

KITopen

HEVC CABAC Dekodierung

Author: Habermann Philipp
Publication venue
Publication date: 09/12/2020
Field of study

Video applications have emerged in various fields of our everyday life. They have continuously enhanced the user experience in entertainment and communication services. All this would not have been possible without the evolution of video compression standards and computer architectures over the last decades. Modern video codecs employ sophisticated algorithms to transform raw video data to an intermediate representation consisting of syntax elements, which allows enormous compression rates before reconstructing the video with minimal objective quality losses compared to the original video. Modern computer architectures lay the foundation for these computationally intensive tasks. They provide multiple cores and specialized vector architectures to exploit the massive amount of parallelism that can be found in video applications. Customized hardware solutions follow the same principles. Parallel processing is essential to satisfy real-time performance constraints while optimizing energy efficiency, the latter being the most important design goal for mobile devices. One of the main tasks in modern video compression standards implements a highly sequential algorithm and lacks data-level parallelism in contrast to all other compute-intensive tasks: Context-based Adaptive Binary Arithmetic Coding (CABAC). It is the entropy coding module in the state-of-the-art High Efficiency Video Coding (HEVC) standard and also its successor Versatile Video Coding. Its purpose is the compression and decompression of the intermediate video representation by exploiting statistical properties, thus achieving minimal bitrates. CABAC is one of the main throughput bottlenecks in video coding applications due to the limited parallelization opportunities, especially for high-quality videos. Close-distance control and data dependencies make CABAC even more challenging to implement with modern computer architectures. This thesis addresses the critical CABAC decoding throughput bottleneck by proposing multiple approaches to uncover new parallelization opportunities and to improve the performance with architectural optimizations. First of all, we quantitatively verify the severity of the CABAC decoding throughput bottleneck by evaluating the HEVC decoding performance for various workloads using a representative selection of state-of-the-art computer architectures. The results show that even the most powerful processors cannot provide real-time performance for several high-quality workloads. The profiling results clearly show that CABAC decoding is the main reason for that in most cases. Wavefront Parallel Processing (WPP) is a well-established high-level parallelization technique used in video coding and other applications. It can lead to a high degree of parallelism, however, it suffers from inefficiencies due to the dependencies between consecutive rows in a frame. We present three WPP implementations for HEVC CABAC decoding with improved parallel efficiency. The WPP versions based on more fine-grained dependency checks allow speed-ups up to 1.83x at very low implementation cost. We also present a bitstream partitioning scheme for future video compression standards. It enables additional parallelism in CABAC decoding by distributing syntax elements among eight bitstream partitions. Besides the parallelization opportunities, this allows specialization of the subdecoders responsible for the processing of their corresponding partitions as they have to process fewer types of syntax elements. This leads to further improvements in clock frequency and significant hardware savings compared to a full replication of the CABAC decoder as it is required for approaches such as WPP. Decoding speedups up to 8.5x at the cost of only 61.9% extra hardware area and less than 0.7% bitstream overhead for typical Full High Definition videos make this technique a promising candidate for use in future video compression standards. Furthermore, a cache-based architectural optimization is presented. It replaces the context model memory - a critical component in the CABAC decoder pipeline - by a smaller cache, thus increasing the achievable clock frequency. An application-specific adaptive prefetching algorithm is used together with a context model memory layout optimized for spatial and temporal locality. We perform a design space exploration of different cache configurations, finding that a cache of 2x8 context models provides the best performance. It allows for a 17% increase in clock frequency and miss rates of less than 2%, resulting in performance improvements up to 16.7%. We also propose techniques for more efficient CABAC decoding on general-purpose processors. Frequent hardly predictable branches lead to very inefficient implementations with these processors. Using more complex but linear arithmetic functions for the parallel decoding of binary symbols provides a speedup of up to 2.04x. A separate bitstream partition for this type of binary symbol even allows speedups up to 2.45x at the cost of not more than 0.2% higher bitrate for typical Full High Definition videos. Finally, we provide recommendations for future video compression standards and computer architectures as well as further research ideas for video coding in general and CABAC in particular. The research conducted in this thesis shows multiple approaches that can substantially improve the performance of CABAC decoding, thereby addressing one of the most critical throughput bottlenecks in modern video coding applications.Videoanwendungen haben sich in vielen Bereichen unseres täglichen Lebens etabliert und dabei die Nutzererfahrung in den Bereichen Unterhaltung und Kommunikation zunehmend verbessert. Das wäre ohne die ständige Weiterentwicklung von Videokompressionsstandards und Computerarchitekturen nicht möglich gewesen. Moderne Videocodecs nutzen komplexe Algorithmen, um rohe Videodaten in eine aus Syntaxelementen bestehende Zwischenrepräsentation zu transformieren, was enorme Kompressionsraten erlaubt. Die anschließende Rekonstruktion der Videodaten kann mit minimalen Qualitätsverlusten im Vergleich zum Originalvideo durchgeführt werden. Modern Computerarchitekturen legen die Grundlage für diese rechenintensiven Prozesse. Sie stellen zahlreiche Rechenkerne und spezialisierte Vektorarchitekturen zur Verfügung, welche die zahlreichen Parallelisierungsmöglichkeiten in Videoanwendungen ausnutzen. Die parallele Datenverarbeitung ist essenziell, um die Echtzeitfähigkeit zu gewährleisten und gleichzeitig die Energieeffizienz zu optimieren, was insbesondere für Mobilgeräte eines der wichtigsten Entwicklungsziele darstellt. Context-based Adaptive Binary Arithmetic Coding (CABAC) ist das Entropiekodierungsverfahren im aktuellen High Efficiency Video Coding (HEVC) Standard, sowie in dessen Nachfolger Versatile Video Coding. CABAC ist für die Kompression und Dekompression der Zwischenrepräsentation eines Videos unter Ausnutzung statistischer Gegebenheiten verantwortlich, wodurch minimale Bitraten erreicht werden können. Dafür wird ein sequentieller Algorithmus verwendet, der CABAC im Vergleich zu allen anderen rechenintensiven Komponenten aktueller Videokompressionsstandards keine Ausnutzung von Datenparallelität ermöglicht. Durch die mangelnden Parallelisierungsmöglichkeiten ist CABAC eine der kritischsten Komponenten, welche die Gesamtleistung eines Videodekoders beschränken. Das gilt insbesondere für Videos mit hoher Qualität und dementsprechend hohen Bitraten. Außerdem stelle eine Vielzahl an Steuer- und Datenabhängigkeiten in CABAC moderne Computerarchitekturen vor große Herausforderungen. Das Ziel dieser Doktorarbeit ist die Verbesserung der Leistung des CABAC-Dekoders, da er die Gesamtleistung aktueller Videodekoder maßgeblich beeinflusst. Wir stellen dafür verschiedene Ansätze vor, die einerseits neue Parallelisierungsmöglichkeiten schaffen und andererseits durch architekturelle Optimierungen effizientere Implementierungen ermöglichen. Zuerst verifizieren wir quantitativ, dass CABAC für den Dekodierungsprozess in HEVC eine kritische Komponente ist. Dafür analysieren wir die Dekodierleistung einer repräsentativen Auswahl aktueller Computersysteme für verschiedene typische Videoanwendungen. Die Ergebnisse zeigen, dass selbst die performantesten Prozessoren nicht für alle Anwendungen echtzeitfähig sind. Weitere Untersuchungen bestätigen deutlich, dass CABAC in den meisten Fällen dafür hauptverantwortlich ist. Anschließend beschäftigen wir uns mit der Optimierung von Wavefront Parallel Processing (WPP). Dabei handelt es sich um eine weit verbreitete Parallelisierungstechnik, die in der Videokodierung und vielen anderen Anwendungen verwendet wird. WPP erlaubt ein hohes Maß an Parallelisierung, erleidet aber wegen der Abhängigkeiten zwischen benachbarten Bildbereichen Einbußen in seiner Effizienz. Wir stellen drei Implementierungsvarianten vor, die die Effizienz der Parallelisierung mit WPP für CABAC in HEVC deutlich verbessern. Dies wird durch eine feingranularere Prüfung von Abhängigkeiten im Vergleich zu konventionellen WPP-Implementierungen erreicht. So kann die Dekodierung von Videos um einen Faktor von bis zu 1.83x beschleunigt werden, während die Implementierung nur unwesentlich komplexer wird. Dann stellen wir ein Bitstreampartitionierungsschema für zukünftige Videokompressions-standards vor, welches zusätzliche Parallelisierungsmöglichkeiten schafft. Dies wird durch die Aufteilung aller Syntaxelemente unter Berücksichtigung ihrer Abhängigkeiten auf acht Partitionen erreicht. Zusätzlich ermöglicht dies deutliche Erhöhungen der Taktfrequenz eines Hardwaredekoders, da die spezialisierten Teildekoder für die verschiedenen Partitionen weitaus weniger verschiedene Syntaxelemente bearbeiten müssen. Die reduzierte Komplexität der Teildekoder erlaubt außerdem drastische Hardwareeinsparungen, vor allem im Vergleich zu Techniken wie WPP, die eine vollständige Replikation des CABAC-Dekoders erfordern. Der vorgestellte Dekoder erlaubt eine Beschleunigung um bis zu 8.5x bei lediglich 61.9% zusätzlichen Hardwarekosten und einer Erhöhung der Bitrate um maximal 0.7% bei typischen Full-HD-Videos. Außerdem stellen wir einen Cache-basierten CABAC-Dekoder vor. Dieser ersetzt den Context-Model-Speicher durch einen kleineren Cache und ermöglicht somit den Betrieb mit höheren Taktfrequenzen, da der Speicherzugriff den kritischen Pfad beeinflusst. Die auftretenden Fehlzugriffe auf den Cache werden mit einem optimierten Speicherlayout und einem adaptiven Vorhersagealgorithmus effektiv reduziert. Die Untersuchung verschiedener Cache-Architekturen zeigt, dass ein 2x8 Context-Model-Cache die beste Leistung liefert. Durch die Erhöhung der Taktfrequenz um 17% und eine Fehlzugriffsrate von maximal 2% kann der Durchsatz des Dekoders um bis zu 16.7% erhöht werden. Die letzte vorgestellte Optimierung behandelt die Software-CABAC-Dekodierung. Der Algorithmus beinhaltet viele schwer vorhersagbare Verzweigungen im Steuerfluss, was für aktuelle Prozessoren eine große Herausforderung darstellt und zu ineffizienten Implementierungen führt. Der Einsatz komplexer arithmetischer Instruktionen zur parallelen Dekodierung führt zu einer Beschleunigung bis zu 2.04x. Die Nutzung von zwei Bit-streampartitionen für verschiedene Arten von binären Symbolen ermöglicht es sogar, die Dekodierung einer davon ohne Rechenaufwand durchzuführen. Folglich ist eine noch höhere Beschleunigung bis zu 2.45x bei höchstens 0.2% höherer Bitrate möglich. Abschließend sprechen wir Empfehlungen für die Entwicklung zukünftiger Videokompressionsstandards und Computerarchitekturen aus. Weitere Forschungsideen für Videokodierung im Allgemeinen und CABAC im Besonderen werden ebenfalls diskutiert. Die dieser Arbeit zugrunde liegende Forschung demonstriert bereits einige vielversprechende Ansätze, welche die Leistung von CABAC und damit des gesamten Dekoders deutlich erhöhen können. Durch die Behandlung dieser kritischen Komponente leisten wir einen wichtigen Beitrag zur Verbesserung vieler aktueller und zukünftiger Videoanwendungen

DepositOnce