1,394 research outputs found

    Adaptive Lightweight Compression Acceleration on Hybrid CPU-FPGA System

    Get PDF

    Optimization of the motion estimation for parallel embedded systems in the context of new video standards

    Get PDF
    15 pagesInternational audienceThe effciency of video compression methods mainly depends on the motion compensation stage, and the design of effcient motion estimation techniques is still an important issue. An highly accurate motion estimation can significantly reduce the bit-rate, but involves a high computational complexity. This is particularly true for new generations of video compression standards, MPEG AVC and HEVC, which involves techniques such as different reference frames, sub-pixel estimation, variable block sizes. In this context, the design of fast motion estimation solutions is necessary, and can concerned two linked aspects: a high quality algorithm and its effcient implementation. This paper summarizes our main contributions in this domain. In particular, we first present the HME (Hierarchical Motion Estimation) technique. It is based on a multi-level refinement process where the motion estimation vectors are first estimated on a sub-sampled image. The multi-levels decomposition provides robust predictions and is particularly suited for variable block sizes motion estimations. The HME method has been integrated in a AVC encoder, and we propose a parallel implementation of this technique, with the motion estimation at pixel level performed by a DSP processor, and the sub-pixel refinement realized in an FPGA. The second technique that we present is called HDS for Hierarchical Diamond Search. It combines the multi-level refinement of HME, with a fast search at pixel-accuracy inspired by the EPZS method. This paper also presents its parallel implementation onto a multi-DSP platform and the its use in the HEVC context

    Performance and area evaluations of processor-based benchmarks on FPGA devices

    Get PDF
    The computing system on SoCs is being long-term research since the FPGA technology has emerged due to its personality of re-programmable fabric, reconfigurable computing, and fast development time to market. During the last decade, uni-processor in a SoC is no longer to deal with the high growing market for complex applications such as Mobile Phones audio and video encoding, image and network processing. Due to the number of transistors on a silicon wafer is increasing, the recent FPGAs or embedded systems are advancing toward multi-processor-based design to meet tremendous performance and benefit this kind of systems are possible. Therefore, is an upcoming age of the MPSoC. In addition, most of the embedded processors are soft-cores, because they are flexible and reconfigurable for specific software functions and easy to build homogenous multi-processor systems for parallel programming. Moreover, behavioural synthesis tools are becoming a lot more powerful and enable to create datapath of logic units from high-level algorithms such as C to HDL and available for partitioning a HW/SW concurrent methodology. A range of embedded processors is able to implement on a FPGA-based prototyping to integrate the CPUs on a programmable device. This research is, firstly represent different types of computer architectures in modern embedded processors that are followed in different type of software applications (eg. Multi-threading Operations or Complex Functions) on FPGA-based SoCs; and secondly investigate their capability by executing a wide-range of multimedia software codes (Integer-algometric only) in different models of the processor-systems (uni-processor or multi-processor or Co-design), and finally compare those results in terms of the benchmarks and resource utilizations within FPGAs. All the examined programs were written in standard C and executed in a variety numbers of soft-core processors or hardware units to obtain the execution times. However, the number of processors and their customizable configuration or hardware datapath being generated are limited by a target FPGA resource, and designers need to understand the FPGA-based tradeoffs that have been considered - Speed versus Area. For this experimental purpose, I defined benchmarks into DLP / HLS catalogues, which are "data" and "function" intensive respectively. The programs of DLP will be executed in LEON3 MP and LE1 CMP multi-processor systems and the programs of HLS in the LegUp Co-design system on target FPGAs. In preliminary, the performance of the soft-core processors will be examined by executing all the benchmarks. The whole story of this thesis work centres on the issue of the execute times or the speed-up and area breakdown on FPGA devices in terms of different programs

    Computing and Compressing Electron Repulsion Integrals on FPGAs

    Full text link
    The computation of electron repulsion integrals (ERIs) over Gaussian-type orbitals (GTOs) is a challenging problem in quantum-mechanics-based atomistic simulations. In practical simulations, several trillions of ERIs may have to be computed for every time step. In this work, we investigate FPGAs as accelerators for the ERI computation. We use template parameters, here within the Intel oneAPI tool flow, to create customized designs for 256 different ERI quartet classes, based on their orbitals. To maximize data reuse, all intermediates are buffered in FPGA on-chip memory with customized layout. The pre-calculation of intermediates also helps to overcome data dependencies caused by multi-dimensional recurrence relations. The involved loop structures are partially or even fully unrolled for high throughput of FPGA kernels. Furthermore, a lossy compression algorithm utilizing arbitrary bitwidth integers is integrated in the FPGA kernels. To our best knowledge, this is the first work on ERI computation on FPGAs that supports more than just the single most basic quartet class. Also, the integration of ERI computation and compression it a novelty that is not even covered by CPU or GPU libraries so far. Our evaluation shows that using 16-bit integer for the ERI compression, the fastest FPGA kernels exceed the performance of 10 GERIS (10×10910 \times 10^9 ERIs per second) on one Intel Stratix 10 GX 2800 FPGA, with maximum absolute errors around 10710^{-7} - 10510^{-5} Hartree. The measured throughput can be accurately explained by a performance model. The FPGA kernels deployed on 2 FPGAs outperform similar computations using the widely used libint reference on a two-socket server with 40 Xeon Gold 6148 CPU cores of the same process technology by factors up to 6.0x and on a new two-socket server with 128 EPYC 7713 CPU cores by up to 1.9x

    Hardware accelerated real-time Linux video anonymizer

    Get PDF
    Dissertação de mestrado em Engenharia Eletrónica Industrial e ComputadoresOs Sistemas Embebidos estão presentes atualmente numa variada gama de equipamentos do quotidiano do ser humano. Desde TV-boxes, televisões, routers até ao indispensável telemóvel. O Sistema Operativo Linux, com a sua filosofia de distribuição ”one-size-fits-all” tornou-se uma alternativa viável, fornecendo um vasto suporte de hardware, técnicas de depuração, suporte dos protocolos de comunicação de rede, entre outros serviços, que se tornaram no conjunto standard de requisitos na maioria dos sistemas embebidos atuais. Este sistema operativo torna-se apelativo pela sua filosofia open-source que disponibiliza ao utilizador um vasto conjunto de bibliotecas de software que possibilitam o desenvolvimento num determinado domínio com maior celeridade e facilidade de integração de software complexo. Os algoritmos deMachine Learning são desenvolvidos para a automização de tarefas e estão presentes nas mais variadas tecnologias, desde o sistema de foco de imagem nosmartphone até ao sistema de deteção dos limites de faixa de rodagem de um sistema de condução autónoma. Estes são algoritmos que quando compilados para as plataformas de sistemas embebidos, resultam num esforço de processamento e de consumo de recursos, como o footprint de memória, que na maior parte dos casos supera em larga escala o conjunto de recursos disponíveis para a aplicação do sistema, sendo necessária a implementação de componentes que requerem maior poder de processamento através de elementos de hardware para garantir que as métricas tem porais sejam satisfeitas. Esta dissertação propõe-se, por isso, à criação de um sistema de anonimização de vídeo que adquire, processa e manipula as frames, com o intuito de garantir o anonimato, mesmo na transmissão. A sua implementação inclui técnicas de Deteção de Objectos, fazendo uso da combinação das tecnologias de aceleração por hardware: paralelização e execução em hardware especial izado. É proposta então uma implementação restringida tanto temporalmente como no consumo de recursos ao nível do hardware e software.Embedded Systems are currently present in a wide range of everyday equipment. From TV-boxes, televisions and routers to the indispensable smartphone. Linux Operating System, with its ”one-size-fits-all” distribution philosophy, has become a viable alternative, providing extensive support for hardware, debugging techniques, network com munication protocols, among other functionalities, which have become the standard set of re quirements in most modern embedded systems. This operating system is appealing due to its open-source philosophy, which provides the user with a vast set of software libraries that enable development in a given domain with greater speed and ease the integration of complex software. Machine Learning algorithms are developed to execute tasks autonomously, i.e., without human supervision, and are present in the most varied technologies, from the image focus system on the smartphone to the detection system of the lane limits of an autonomous driving system. These are algorithms that, when compiled for embedded systems platforms, require an ef fort to process and consume resources, such as the memory footprint, which in most cases far outweighs the set of resources available for the application of the system, requiring the imple mentation of components that need greater processing power through elements of hardware to ensure that the time metrics are satisfied. This dissertation proposes the creation of a video anonymization system that acquires, pro cesses, and manipulates the frames, in order to guarantee anonymity, even during the transmis sion. Its implementation includes Object Detection techniques, making use of the combination of hardware acceleration technologies: parallelization and execution in specialized hardware. An implementation is then proposed, restricted both in time and in resource consumption at hardware and software levels

    FPGA ACCELERATION OF A CORTICAL AND A MATCHED FILTER-BASED ALGORITHM

    Get PDF
    Digital image processing is a widely used and diverse field. It is used in a broad array of areas such as tracking and detection, object avoidance, computer vision, and numerous other applications. For many image processing tasks, the computations can become time consuming. Therefore, a means for accelerating the computations would be beneficial. Using that as motivation, this thesis examines the acceleration of two distinctly different image processing applications. The first image processing application examined is a recent neocortex inspired cognitive model geared towards pattern recognition as seen in the visual cortex. For this model, both software and reconfigurable logic based FPGA implementations of the model are examined on a Cray XD1. Results indicate that hardware-acceleration can provide average throughput gains of 75 times over software-only implementations of the networks examined when utilizing the full resources of the Cray XD1. The second image processing application examined is matched filter-based position detection. This approach is at the heart of the automatic alignment algorithm currently being tested in the National Ignition Faculty presently under construction at the Lawrence Livermore National Laboratory. To reduce the processing time of the match filtering, a reconfigurable logic architecture was developed. Results show that the reconfigurable logic architecture provides a speedup of approximately 253 times over an optimized software implementation
    corecore