692 research outputs found

    Efficient Parallel Video Encoding on Heterogeneous Systems

    Get PDF
    Proceedings of: First International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2014). Porto (Portugal), August 27-28, 2014.In this study we propose an efficient method for collaborative H.264/AVC inter-loop encoding in heterogeneous CPU+GPU systems. This method relies on specifically developed extensive library of highly optimized parallel algorithms for both CPU and GPU architectures, and all inter-loop modules. In order to minimize the overall encoding time, this method integrates adaptive load balancing for the most computationally intensive, inter-prediction modules, which is based on dynamically built functional performance models of heterogenous devices and inter-loop modules. The proposed method also introduces efficient communication-aware techniques, which maximize data reusing, and decrease the overhead of expensive data transfers in collaborative video encoding. The experimental results show that the proposed method is able of achieving real-time video encoding for very demanding video coding parameters, i.e., full HD video format, 64×64 pixels search area and the exhaustive motion estimation.This work was supported by national funds through FCT – Fundação para a CiĂȘncia e a Tecnologia, under projects PEst-OE/EEI/LA0021/2013, PTDC/EEI-ELC/3152/2012 and PTDC/EEA-ELC/117329/2010

    High Performance Multiview Video Coding

    Get PDF
    Following the standardization of the latest video coding standard High Efficiency Video Coding in 2013, in 2014, multiview extension of HEVC (MV-HEVC) was published and brought significantly better compression performance of around 50% for multiview and 3D videos compared to multiple independent single-view HEVC coding. However, the extremely high computational complexity of MV-HEVC demands significant optimization of the encoder. To tackle this problem, this work investigates the possibilities of using modern parallel computing platforms and tools such as single-instruction-multiple-data (SIMD) instructions, multi-core CPU, massively parallel GPU, and computer cluster to significantly enhance the MVC encoder performance. The aforementioned computing tools have very different computing characteristics and misuse of the tools may result in poor performance improvement and sometimes even reduction. To achieve the best possible encoding performance from modern computing tools, different levels of parallelism inside a typical MVC encoder are identified and analyzed. Novel optimization techniques at various levels of abstraction are proposed, non-aggregation massively parallel motion estimation (ME) and disparity estimation (DE) in prediction unit (PU), fractional and bi-directional ME/DE acceleration through SIMD, quantization parameter (QP)-based early termination for coding tree unit (CTU), optimized resource-scheduled wave-front parallel processing for CTU, and workload balanced, cluster-based multiple-view parallel are proposed. The result shows proposed parallel optimization techniques, with insignificant loss to coding efficiency, significantly improves the execution time performance. This , in turn, proves modern parallel computing platforms, with appropriate platform-specific algorithm design, are valuable tools for improving the performance of computationally intensive applications

    Homogeneous and heterogeneous MPSoC architectures with network-on-chip connectivity for low-power and real-time multimedia signal processing

    Get PDF
    Two multiprocessor system-on-chip (MPSoC) architectures are proposed and compared in the paper with reference to audio and video processing applications. One architecture exploits a homogeneous topology; it consists of 8 identical tiles, each made of a 32-bit RISC core enhanced by a 64-bit DSP coprocessor with local memory. The other MPSoC architecture exploits a heterogeneous-tile topology with on-chip distributed memory resources; the tiles act as application specific processors supporting a different class of algorithms. In both architectures, the multiple tiles are interconnected by a network-on-chip (NoC) infrastructure, through network interfaces and routers, which allows parallel operations of the multiple tiles. The functional performances and the implementation complexity of the NoC-based MPSoC architectures are assessed by synthesis results in submicron CMOS technology. Among the large set of supported algorithms, two case studies are considered: the real-time implementation of an H.264/MPEG AVC video codec and of a low-distortion digital audio amplifier. The heterogeneous architecture ensures a higher power efficiency and a smaller area occupation and is more suited for low-power multimedia processing, such as in mobile devices. The homogeneous scheme allows for a higher flexibility and easier system scalability and is more suited for general-purpose DSP tasks in power-supplied devices

    Video Coding Performance

    Get PDF

    Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)

    Get PDF
    Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.The PhD Symposium was a very good opportunity for the young researchers to share information and knowledge, to present their current research, and to discuss topics with other students in order to look for synergies and common research topics. The idea was very successful and the assessment made by the PhD Student was very good. It also helped to achieve one of the major goals of the NESUS Action: to establish an open European research network targeting sustainable solutions for ultrascale computing aiming at cross fertilization among HPC, large scale distributed systems, and big data management, training, contributing to glue disparate researchers working across different areas and provide a meeting ground for researchers in these separate areas to exchange ideas, to identify synergies, and to pursue common activities in research topics such as sustainable software solutions (applications and system software stack), data management, energy efficiency, and resilience.European Cooperation in Science and Technology. COS

    Doctor of Philosophy

    Get PDF
    dissertationAs the base of the software stack, system-level software is expected to provide ecient and scalable storage, communication, security and resource management functionalities. However, there are many computationally expensive functionalities at the system level, such as encryption, packet inspection, and error correction. All of these require substantial computing power. What's more, today's application workloads have entered gigabyte and terabyte scales, which demand even more computing power. To solve the rapidly increased computing power demand at the system level, this dissertation proposes using parallel graphics pro- cessing units (GPUs) in system software. GPUs excel at parallel computing, and also have a much faster development trend in parallel performance than central processing units (CPUs). However, system-level software has been originally designed to be latency-oriented. GPUs are designed for long-running computation and large-scale data processing, which are throughput-oriented. Such mismatch makes it dicult to t the system-level software with the GPUs. This dissertation presents generic principles of system-level GPU computing developed during the process of creating our two general frameworks for integrating GPU computing in storage and network packet processing. The principles are generic design techniques and abstractions to deal with common system-level GPU computing challenges. Those principles have been evaluated in concrete cases including storage and network packet processing applications that have been augmented with GPU computing. The signicant performance improvement found in the evaluation shows the eectiveness and eciency of the proposed techniques and abstractions. This dissertation also presents a literature survey of the relatively young system-level GPU computing area, to introduce the state of the art in both applications and techniques, and also their future potentials

    Exploiting multiple levels of parallelism of Convergent Cross Mapping

    Get PDF
    Identifying causal relationships between variables remains an essential problem across various scientific fields. Such identification is particularly important but challenging in complex systems, such as those involving human behaviour, sociotechnical contexts, and natural ecosystems. By exploiting state space reconstruction via lagged embeddings of time series, convergent cross mapping (CCM) serves as an important method for addressing this problem. While powerful, CCM is computationally costly; moreover, CCM results are highly sensitive to several parameter values. Current best practice involves performing a systematic search on a range of parameters, but results in high computational burden, which mainly raises barriers to practical use. In light of both such challenges and the growing size of commonly encountered datasets from complex systems, inferring the causality with confidence using CCM in a reasonable time becomes a biggest challenge. In this thesis, I investigate the performance associated with a variety of parallel techniques (CUDA, Thrust, OpenMP, MPI and Spark, etc.,) to accelerate convergent cross mapping. The performance of each method was collected and compared across multiple experiments to further evaluate potential bottlenecks. Moreover, the work deployed and tested combinations of these techniques to more thoroughly exploit available computation resources. The results obtained from these experiments indicate that GPUs can only accelerate the CCM algorithm under certain circumstances and requirements. Otherwise, the overhead of data transfer and communication can become the limiting bottleneck. On the other hand, in cluster computing, the MPI/OpenMP framework outperforms the Spark framework by more than one order of magnitude in terms of processing speed and provides more consistent performance for distributed computing. This also reflects the large size of the output from the CCM algorithm. However, Spark shows better cluster infrastructure management, ease of software engineering, and more ready handling of other aspects, such as node failure and data replication. Furthermore, combinations of GPU and cluster frameworks are deployed and compared in GPU/CPU clusters. An apparent speedup can be achieved in the Spark framework, while extra time cost is incurred in the MPI/OpenMP framework. The underlying reason reflects the fact that the code complexity imposed by GPU utilization cannot be readily offset in the MPI/OpenMP framework. Overall, the experimental results on parallelized solutions have demonstrated a capacity for over an order of magnitude performance improvement when compared with the widely used current library rEDM. Such economies in computation time can speed learning and robust identification of causal drivers in complex systems. I conclude that these parallel techniques can achieve significant improvements. However, the performance gain varies among different techniques or frameworks. Although the use of GPUs can accelerate the application, there still exists constraints required to be taken into consideration, especially with regards to the input data scale. Without proper usage, GPUs use can even slow down the whole execution time. Convergent cross mapping can achieve a maximum speedup by adopting the MPI/OpenMP framework, as it is suitable to computation-intensive algorithms. By contrast, the Spark framework with integrated GPU accelerators still offers low execution cost comparing to the pure Spark version, which mainly fits in data-intensive problems
    • 

    corecore