296 research outputs found

    A highly scalable parallel implementation of H.264

    Get PDF
    Developing parallel applications that can harness and efficiently use future many-core architectures is the key challenge for scalable computing systems. We contribute to this challenge by presenting a parallel implementation of H.264 that scales to a large number of cores. The algorithm exploits the fact that independent macroblocks (MBs) can be processed in parallel, but whereas a previous approach exploits only intra-frame MB-level parallelism, our algorithm exploits intra-frame as well as inter-frame MB-level parallelism. It is based on the observation that inter-frame dependencies have a limited spatial range. The algorithm has been implemented on a many-core architecture consisting of NXP TriMedia TM3270 embedded processors. This required to develop a subscription mechanism, where MBs are subscribed to the kick-off lists associated with the reference MBs. Extensive simulation results show that the implementation scales very well, achieving a speedup of more than 54 on a 64-core processor, in which case the previous approach achieves a speedup of only 23. Potential drawbacks of the 3D-Wave strategy are that the memory requirements increase since there can be many frames in flight, and that the frame latency might increase. Scheduling policies to address these drawbacks are also presented. The results show that these policies combat memory and latency issues with a negligible effect on the performance scalability. Results analyzing the impact of the memory latency, L1 cache size, and the synchronization and thread management overhead are also presented. Finally, we present performance requirements for entropy (CABAC) decoding. This work was performed while the fourth author was with NXP Semiconductors.Peer ReviewedPostprint (author's final draft

    Parallel scalability of video decoders

    No full text
    An important question is whether emerging and future applications exhibit sufficient parallelism, in particular thread-level parallelism, to exploit the large numbers of cores future chip multiprocessors (CMPs) are expected to contain. As a case study we investigate the parallelism available in video decoders, an important application domain now and in the future. Specifically, we analyze the parallel scalability of the H.264 decoding process. First we discuss the data structures and dependencies of H.264 and show what types of parallelism it allows to be exploited. We also show that previously proposed parallelization strategies such as slice-level, frame-level, and intra-frame macroblock (MB) level parallelism, are not sufficiently scalable. Based on the observation that inter-frame dependencies have a limited spatial range we propose a new parallelization strategy, called Dynamic 3D-Wave. It allows certain MBs of consecutive frames to be decoded in parallel. Using this new strategy we analyze the limits to the available MB-level parallelism in H.264. Using real movie sequences we find a maximum MB parallelism ranging from 4000 to 7000. We also perform a case study to assess the practical value and possibilities of a highly parallelized H.264 application. The results show that H.264 exhibits sufficient parallelism to efficiently exploit the capabilities of future manycore CMPs.Peer ReviewedPostprint (published version

    Video Coding Performance

    Get PDF

    Performance study of synthetic AER generation on CPUs for Real-Time Video based on Spikes

    Get PDF
    Address-Event-Representation (AER) is a neuromorphic interchip communication protocol that allows for real-time virtual massive connectivity between huge number neurons located on different chips. When building multi-chip muti-layered AER systems it is absolutely necessary to have a computer interface that allows (a) to read AER interchip traffic into the computer and visualize it on screen, and (b) convert conventional frame-based video stream in the computer into AER and inject it at some point of the AER structure. This is necessary for test and debugging of complex AER systems. Previous work presented several software methods for converting digital frames into AER format. Those methods were not feasible for real-time conversion those days because the processor performance was insufficient. Nowadays, Multi-core processor architectures and cache hierarchies have evolved and the performance is much better than Pentium 4 Mobile of those years. In this paper we study frame-to-AER methods for realtime video applications (40ms per frame) using modern processor architectures, compilers, and processors oriented for stand-alone applications (mini-PC processors

    Fine-Grain Parallelism

    Get PDF
    Computer hardware is at the beginning of the multi-core revolution. While hardware at the commodity level is capable of running concurrent software, most software does not take advantage of this fact because parallel software development is difficult. This project addressed potential remedies to these difficulties by investigating graphical programming and fine-grain parallelism. A prototype system taking advantage of both of these concepts was implemented and evaluated in terms of real-world applications

    Media gateway utilizando um GPU

    Get PDF
    Mestrado em Engenharia de Computadores e Telemátic

    Investigation of parallel programming on heterogeneous multiprocessors

    Get PDF
    Multi-core processors have become ordinary in modern commodity computers. Computationally intensive applications, like video processing, that previously only ran on specialized hardware, are now common on home computers. However, the demand for more computing power is ever-increasing, and with the introduction of high definition video, more performance is desired. As an alternative to having multiple identical processor cores, heterogeneous multiprocessors have cores with different capabilities. This allows tasks to be processed on simple cores with specialized functionality. The simplicity furthers low power consumption, small die usage, and low price. Dealing with heterogeneous cores increases the complexity of writing programs for the architecture. The reasons for this includes different capabilities of the cores, and some heterogeneous architectures do not have shared memory. Without shared memory, accessing main memory requires explicit transfers to local memory. In this thesis, we consider two architectures, the STI Cell/B.E. and Intel IXP2400, and evaluate parallelization strategies and performance for real-world problems. Our tests show promising throughput for some applications, and we propose a scheme for offloading computationally intensive parts of an existing application

    H.264/AVC inter prediction on accelerator-based multi-core systems

    Get PDF
    The AVC video coding standard adopts variable block sizes for inter frame coding to increase compression efficiency, among other new features. As a consequence of this, an AVC encoder has to employ a complex mode decision technique that requires high computational complexity. Several techniques aimed at accelerating the inter prediction process have been proposed in the literature in recent years. Recently, with the emergence of many-core processors or accelerators, a new way of supporting inter frame prediction has presented itself. In this paper, we present a step forward in the implementation of an AVC inter prediction algorithm in a graphics processing unit, using Compute Unified Device Architecture. The results show a negligible drop in rate distortion with a time reduction, on average, of over 98.8 % compared with full search and fast full search, and of over 80 % compared with UMHexagonS search

    Processamento de imagens médicas usando GPU

    Get PDF
    Mestrado em Engenharia de Computadores e TelemáticaA aplicação CapView utiliza um algoritmo de classificação baseado em SVM (Support Vector Machines) para automatizar a segmentação topográfica de vídeos do trato intestinal obtidos por cápsula endoscópica. Este trabalho explora a aplicação de processadores gráficos (GPU) para execução paralela desse algoritmo. Após uma etapa de otimização da versão sequencial, comparou-se o desempenho obtido por duas abordagens: (1) desenvolvimento apenas do código do lado do host, com suporte em bibliotecas especializadas para a GPU, e (2) desenvolvimento de todo o código, incluindo o que é executado no GPU. Ambas permitiram ganhos (speedups) significativos, entre 1,4 e 7 em testes efetuados com GPUs individuais de vários modelos. Usando um cluster de 4 GPU do modelo de maior capacidade, conseguiu-se, em todos os casos testados, ganhos entre 26,2 e 27,2 em relação à versão sequencial otimizada. Os métodos desenvolvidos foram integrados na aplicação CapView, utilizada em rotina em ambientes hospitalares.The CapView application uses a classification algorithm based on SVMs (Support Vector Machines) for automatic topographic segmentation of gastrointestinal tract videos obtained through capsule endoscopy. This work explores the use graphic processors (GPUs) to parallelize the segmentation algorithm. After an optimization phase of the sequential version, two new approaches were analyzed: (1) development of the host code only, with support of specialized libraries for the GPU, and (2) development of the host and the device’s code. The two approaches caused substantial gains, with speedups between 1.4 and 7 times in tests made with several different individual GPUs. In a cluster of 4 GPUs of the most capable model, speedups between 26.2 and 27.2 times were achieved, compared to the optimized sequential version. The methods developed were integrated in the CapView application, used in routine in medical environments
    corecore