10 research outputs found

    Tailored AVX2 Transform Kernels for Versatile Video Coding

    Get PDF
    Transform coding tools play an integral part in video codecs due to their substantial impact on coding efficiency. The latest video coding standard, Versatile Video Coding (VVC), makes the most of these tools by introducing new DST7, DCT8, and non-square transforms alongside the conventional DCT2 transform. This paper proposes optimized AVX2 kernels for all these transforms to speed up VVC coding. Unlike existing solutions, our kernels are specially tailored for each VVC transform type and block size. Accelerating our open-source uvg266 VVC encoder with the proposed kernels yields up to a 1.1× speedup under all intra (AI) coding condition without any coding overhead. Our implementations make forward DCT2 and DST7/DCT8 transforms 4.0× and 6.7× as fast as their respective scalar implementations in the VTM reference encoder. They also outpace the AVX2 kernels of the practical VVenC encoder by factors of 3.0× and 2.8×. The respective speedups rise up to 5.3×, 11.1×, 3.4×, and 3.0× with inverse transforms.Peer reviewe

    Feasibility Study of High-Level Synthesis : Implementation of a Real-Time HEVC Intra Encoder on FPGA

    Get PDF
    High-Level Synthesis (HLS) on automatisoitu suunnitteluprosessi, joka pyrkii parantamaan tuottavuutta perinteisiin suunnittelumenetelmiin verrattuna, nostamalla suunnittelun abstraktiota rekisterisiirtotasolta (RTL) käyttäytymistasolle. Erilaisia kaupallisia HLS-työkaluja on ollut markkinoilla aina 1990-luvulta lähtien, mutta vasta äskettäin ne ovat alkaneet saada hyväksyntää teollisuudessa sekä akateemisessa maailmassa. Hidas käyttöönottoaste on johtunut pääasiassa huonommasta tulosten laadusta (QoR) kuin mitä on ollut mahdollista tavanomaisilla laitteistokuvauskielillä (HDL). Uusimmat HLS-työkalusukupolvet ovat kuitenkin kaventaneet QoR-aukkoa huomattavasti. Tämä väitöskirja tutkii HLS:n soveltuvuutta videokoodekkien kehittämiseen. Se esittelee useita HLS-toteutuksia High Efficiency Video Coding (HEVC) -koodaukselle, joka on keskeinen mahdollistava tekniikka lukuisille nykyaikaisille mediasovelluksille. HEVC kaksinkertaistaa koodaustehokkuuden edeltäjäänsä Advanced Video Coding (AVC) -standardiin verrattuna, saavuttaen silti saman subjektiivisen visuaalisen laadun. Tämä tyypillisesti saavutetaan huomattavalla laskennallisella lisäkustannuksella. Siksi reaaliaikainen HEVC vaatii automatisoituja suunnittelumenetelmiä, joita voidaan käyttää rautatoteutus- (HW ) ja varmennustyön minimoimiseen. Tässä väitöskirjassa ehdotetaan HLS:n käyttöä koko enkooderin suunnitteluprosessissa. Dataintensiivisistä koodaustyökaluista, kuten intra-ennustus ja diskreetit muunnokset, myös enemmän kontrollia vaativiin kokonaisuuksiin, kuten entropiakoodaukseen. Avoimen lähdekoodin Kvazaar HEVC -enkooderin C-lähdekoodia hyödynnetään tässä työssä referenssinä HLS-suunnittelulle sekä toteutuksen varmentamisessa. Suorituskykytulokset saadaan ja raportoidaan ohjelmoitavalla porttimatriisilla (FPGA). Tämän väitöskirjan tärkein tuotos on HEVC intra enkooderin prototyyppi. Prototyyppi koostuu Nokia AirFrame Cloud Server palvelimesta, varustettuna kahdella 2.4 GHz:n 14-ytiminen Intel Xeon prosessorilla, sekä kahdesta Intel Arria 10 GX FPGA kiihdytinkortista, jotka voidaan kytkeä serveriin käyttäen joko peripheral component interconnect express (PCIe) liitäntää tai 40 gigabitin Ethernettiä. Prototyyppijärjestelmä saavuttaa reaaliaikaisen 4K enkoodausnopeuden, jopa 120 kuvaa sekunnissa. Lisäksi järjestelmän suorituskykyä on helppo skaalata paremmaksi lisäämällä järjestelmään käytännössä minkä tahansa määrän verkkoon kytkettäviä FPGA-kortteja. Monimutkaisen HEVC:n tehokas mallinnus ja sen monipuolisten ominaisuuksien mukauttaminen reaaliaikaiselle HW HEVC enkooderille ei ole triviaali tehtävä, koska HW-toteutukset ovat perinteisesti erittäin aikaa vieviä. Tämä väitöskirja osoittaa, että HLS:n avulla pystytään nopeuttamaan kehitysaikaa, tarjoamaan ennen näkemätöntä suunnittelun skaalautuvuutta, ja silti osoittamaan kilpailukykyisiä QoR-arvoja ja absoluuttista suorituskykyä verrattuna olemassa oleviin toteutuksiin.High-Level Synthesis (HLS) is an automated design process that seeks to improve productivity over traditional design methods by increasing design abstraction from register transfer level (RTL) to behavioural level. Various commercial HLS tools have been available on the market since the 1990s, but only recently they have started to gain adoption across industry and academia. The slow adoption rate has mainly stemmed from lower quality of results (QoR) than obtained with conventional hardware description languages (HDLs). However, the latest HLS tool generations have substantially narrowed the QoR gap. This thesis studies the feasibility of HLS in video codec development. It introduces several HLS implementations for High Efficiency Video Coding (HEVC) , that is the key enabling technology for numerous modern media applications. HEVC doubles the coding efficiency over its predecessor Advanced Video Coding (AVC) standard for the same subjective visual quality, but typically at the cost of considerably higher computational complexity. Therefore, real-time HEVC calls for automated design methodologies that can be used to minimize the HW implementation and verification effort. This thesis proposes to use HLS throughout the whole encoder design process. From data-intensive coding tools, like intra prediction and discrete transforms, to more control-oriented tools, such as entropy coding. The C source code of the open-source Kvazaar HEVC encoder serves as a design entry point for the HLS flow, and it is also utilized in design verification. The performance results are gathered with and reported for field programmable gate array (FPGA) . The main contribution of this thesis is an HEVC intra encoder prototype that is built on a Nokia AirFrame Cloud Server equipped with 2.4 GHz dual 14-core Intel Xeon processors and two Intel Arria 10 GX FPGA Development Kits, that can be connected to the server via peripheral component interconnect express (PCIe) generation 3 or 40 Gigabit Ethernet. The proof-of-concept system achieves real-time. 4K coding speed up to 120 fps, which can be further scaled up by adding practically any number of network-connected FPGA cards. Overcoming the complexity of HEVC and customizing its rich features for a real-time HEVC encoder implementation on hardware is not a trivial task, as hardware development has traditionally turned out to be very time-consuming. This thesis shows that HLS is able to boost the development time, provide previously unseen design scalability, and still result in competitive performance and QoR over state-of-the-art hardware implementations

    Parametrien etsintä HEVC:n tehokkaalle moodivalinnalle

    Get PDF
    High Efficiency Video Coding (HEVC) is the latest video coding standard. It halves the achieved bit rate compared with the previous standard, Advanced Video Coding (AVC). However, the bit rate decrease comes with 40% increase in encoding complexity. This is mainly due to larger number of block coding modes, including Symmetric motion partitions (SMPs), Asymmetric motion partitions (AMPs), and larger coding units of up to 64x64 pixels. These new features are mainly used for Inter prediction that accounts for 60-70% of the whole encoding time. For this reason, optimization of Inter prediction is the main topic in this Thesis. To tackle the Inter prediction complexity, a parametric exploration was chosen as the approach. The exploration was done by gradually shifting the focus from the most coarse optimization to the parameter fine tuning. The selected approach in this study required thousands of individual tests so an automated solution was needed. This led to the creation of a new software solution, TUT Task Manager. It is capable of automatically distributing the tasks of parametric exploration to any number of nodes available in the local network. In total, TUT Task Manager was used to run 4000 tests with a combined CPU time of 14 months. The results were used to create a set of recommended schemes for Inter mode selection. Overall, these new schemes are shown to provide 31-50% complexity saving against the default configuration of HM 11.0, with a minor bit rate increase of 0.2-1.3%. They also provide better RDC performance than the existing solutions. The tools and methods used in this work are so generic that they can be used to further optimize other parts of the video codec

    Kvazaar HEVC videokooderin pakkaustehokkuuden ja suorituskyvyn optimointi

    Get PDF
    Growing video resolutions have led to an increasing volume of Internet video traffic, which has created a need for more efficient video compression. New video coding standards, such as High Efficiency Video Coding (HEVC), enable a higher level of compression, but the complexity of the corresponding encoder implementations is also higher. Therefore, encoders that are efficient in terms of both compression and complexity are required. In this work, we implement four optimizations to Kvazaar HEVC encoder: 1) uniform inter and intra cost comparison; 2) concurrency-oriented SAO implementation; 3) resolution-adaptive thread allocation; and 4) fast cost estimation of coding coefficients. Optimization 1 changes the selection criterion of the prediction mode in fast configurations, which greatly improves the coding efficiency. Optimization 2 replaces the implementation of one of the in-loop filters with one that better supports concurrent processing. This allows removing some dependencies between encoding tasks, which provides more opportunities for parallel processing to increase coding speed. Optimization 3 reduces the overhead of thread management by spawning fewer threads when there is not enough work for all available threads. Optimization 4 speeds up the computation of residual coefficient coding costs by switching to a faster but less accurate estimation. The impact of the optimizations is measured with two coding configurations of Kvazaar: the ultrafast preset, which aims for the fastest coding speed, and the veryslow preset, which aims for the best coding efficiency. Together, the introduced optimizations give a 2.8× speedup in the ultrafast configuration and a 3.4× speedup in the veryslow configuration. The trade-off for the speedup with the veryslow preset is a 0.15 % bit rate increase. However, with the ultrafast preset, the optimizations also improve coding efficiency by 14.39 %

    Resource management for power-constrained HEVC transcoding using reinforcement learning

    Get PDF
    The advent of online video streaming applications and services along with the users' demand for high-quality contents require High Efficiency Video Coding (HEVC), which provides higher video quality and more compression at the cost of increased complexity. On one hand, HEVC exposes a set of dynamically tunable parameters to provide trade-offs among Quality-of-Service (QoS), performance, and power consumption of multi-core servers on the video providers' data center. On the other hand, resource management of modern multi-core servers is in charge of adapting system-level parameters, such as operating frequency and multithreading, to deal with concurrent applications and their requirements. Therefore, efficient multi-user HEVC streaming necessitates joint adaptation of application- and system-level parameters. Nonetheless, dealing with such a large and dynamic design space is challenging and difficult to address through conventional resource management strategies. Thus, in this work, we develop a multi-agent Reinforcement Learning framework to jointly adjust application- and system-level parameters at runtime to satisfy the QoS of multi-user HEVC streaming in power-constrained servers. In particular, the design space, composed of all design parameters, is split into smaller independent sub-spaces. Each design sub-space is assigned to a particular agent so that it can explore it faster, yet accurately. The benefits of our approach are revealed in terms of adaptability and quality (with up to to 4x improvements in terms of QoS when compared to a static resource management scheme), and learning time (6 x faster than an equivalent mono-agent implementation). Finally, we show that the power-capping techniques formulated outperform the hardware-based power capping with respect to quality

    Gestión de recursos energéticamente eficiente para aplicaciones paralelas basadas en tareas en entornos multi-aplicación

    Get PDF
    Tesis de la Universidad Complutense de Madrid, Facultad de Informática, leída el 28/01/2021The end of Dennard scaling, as well as the arrival of the post-Moore era, has meant a big change in the way performance and energy efficiency are achieved by modern processors. From a constant increase of the clock frequency as the main method to increase performance at the beginning of the 2000s, the increase in the number of cores inside processors running at relatively conservative frequencies has stabilised as the current trend to increase both performance and energy efficiency. The increase of the heterogeneity in the systems, both inside the processors comprising different types of cores (e.g., big LITTLE architectures) or adding specific compute units (like multimedia extensions), as well as in the platform by the addition of other specific compute units (like GPUs), offering different performance and energy-efficiency trade-offs. Together with the increase in the number of cores, the processor evolution has been accompanied by the addition of different techologies that allow processors to adapt dynamically to the changes in the environment and running aplications. Among others, techiniques like dynamic voltage and frequiency scaling, power capping or cache partitioning are widely used nowadays to increase the performance and/or energy-efficiency...El fin del escalado de Dennard, así como la llegada de la era post-Moore ha supuesto una gran revolución en la forma de obtener el rendimiento y eficiencia energética en los procesadores modernos. Desde un incremento constante en la frecuencia relativamente moderadas se ha impuesto como la tendencia actual para incrementar tanto el rendimiento como la eficiencia energética. El aumento del número de núcleos dentro del procesado ha venido acompañado en los últimos años por el aumento de la heterogeneidad en la plataforma, tanto dentro del procesador incorporando distintos tipos de núcleos en el mismo procesador (e.g., la arquitectura big.LITTLE) como añadiendo unidades de cómputo específicas (e.g., extensiones multimedia), como la incorporación de otros elementos de computo específicos, ofreciendo diferentes grados de rendimiento y eficiencia energética. La evolución de los procesadores no solo ha venido dictada por el aumento del número de núcleos, sino que ha venido acompañada por la incorporación de diferentes técnicas permitiendo la adaptación de las arquitecturas de forma dinámica al entorno así como a las aplicaciones en ejecución. Entre otras, técnicas como el escalado de frecuencia, la limitación de consumo o el particionado de la memoria caché son ampliamente utilizadas en la actualidad como métodos para incrementar el consumo y/o la eficiencia energética...Fac. de InformáticaTRUEunpu

    Remote Sensing Data Compression

    Get PDF
    A huge amount of data is acquired nowadays by different remote sensing systems installed on satellites, aircrafts, and UAV. The acquired data then have to be transferred to image processing centres, stored and/or delivered to customers. In restricted scenarios, data compression is strongly desired or necessary. A wide diversity of coding methods can be used, depending on the requirements and their priority. In addition, the types and properties of images differ a lot, thus, practical implementation aspects have to be taken into account. The Special Issue paper collection taken as basis of this book touches on all of the aforementioned items to some degree, giving the reader an opportunity to learn about recent developments and research directions in the field of image compression. In particular, lossless and near-lossless compression of multi- and hyperspectral images still remains current, since such images constitute data arrays that are of extremely large size with rich information that can be retrieved from them for various applications. Another important aspect is the impact of lossless compression on image classification and segmentation, where a reasonable compromise between the characteristics of compression and the final tasks of data processing has to be achieved. The problems of data transition from UAV-based acquisition platforms, as well as the use of FPGA and neural networks, have become very important. Finally, attempts to apply compressive sensing approaches in remote sensing image processing with positive outcomes are observed. We hope that readers will find our book useful and interestin

    AVX2-optimized Kvazaar HEVC intra encoder

    Get PDF
    This paper presents efficient SIMD optimizations for the open-source Kvazaar HEVC intra encoder. The C implementation of Kvazaar is accelerated by Intel AVX2 instructions whose effect on Kvazaar ultrafast preset is profiled. According to our profiling results, C functions of SATD, DCT, quantization, and intra prediction account for over 60% of the total intra coding time of Kvazaar ultrafast preset. This work shows that optimizing primarily these functions doubles the coding speed of a single-threaded Kvazaar intra encoder for the same rate-distortion performance. The highest performance boost is obtained by deploying the proposed optimizations jointly with multithreading. On the Intel 8-core i7 processor, the AVX2-optimized 16-threaded Kvazaar ultrafast preset achieves real-time (30 fps) intra coding speed up to 1080p resolution. Compared to AVX2-optimized ultrafast preset of x265, Kvazaar is 20% times faster and still obtains 9.1% bit rate gain for the same quality. These results justify that Kvazaar is currently the leading open-source HEVC intra encoder in terms of real-time coding speed and efficiency.acceptedVersionPeer reviewe
    corecore