8 research outputs found

    Zynq-Based Reconfigurable System for Real-Time Edge Detection of Noisy Video Sequences

    Get PDF
    We implement Zynq-based self-reconfigurable system to perform real-time edge detection of 1080p video sequences. While object edge detection is a fundamental tool in computer vision, noises in the video frames negatively affect edge detection results significantly. Moreover, due to the high computational complexity of 1080p video filtering operations, hardware implementation on reconfigurable hardware fabric is necessary. Here, the proposed embedded system utilizes dynamic reconfiguration capability of Zynq SoC so that partial reconfiguration of different filter bitstreams is performed during run-time according to the detected noise density level in the incoming video frames. Pratt’s Figure of Merit (PFOM) to evaluate the accuracy of edge detection is analyzed for various noise density levels, and we demonstrate that adaptive run-time reconfiguration of the proposed filter bitstreams significantly increases the accuracy of edge detection results while efficiently providing computing power to support real-time processing of 1080p video frames. Performance results on configuration time, CPU usage, and hardware resource utilization are also compared

    Computing Sorted Subsets for Data Processing in Communicating Software/Hardware Control Systems

    Get PDF
    Computing and filtering sorted subsets are frequently required in statistical data manipulation and control applications. The main objective is to extract subsets from large data sets in accordance with some criteria, for example, with the maximum and/or the minimum values in the entire set or within the predefined constraints. The paper suggests a new computation method enabling the indicated above problem to be solved in all programmable systems-on-chip from the Xilinx Zynq family that combine a dual-core Cortex-A9 processing unit and programmable logic linked by high-performance interfaces. The method involves highly parallel sorting networks and run-time filtering. The computations are done in communicating software, running in the processing unit, and hardware, implemented in the programmable logic. Practical applications of the proposed technique are also shown. The results of implementation and experiments clearly demonstrate significant speed-up of the developed software/hardware system comparing to alternative software implementations

    Data processing in high-performance computing systems

    Get PDF
    The paper integrates the results of a large group of authors working in different areas that are important in the scope of big data, including but not limited to: overview of the basic solutions for the development of data centers; storage and processing; decomposition of a problem into sub-problems of lower complexity (such as applying divide and conquer algorithms); models and methods allowing broad parallelism to be realized; alternative techniques for potential acceleration; programming languages; and practical applications

    An integrated hardware/software design methodology for signal processing systems

    Get PDF
    This paper presents a new methodology for design and implementation of signal processing systems on system-on-chip (SoC) platforms. The methodology is centered on the use of lightweight application programming interfaces for applying principles of dataflow design at different layers of abstraction. The development processes integrated in our approach are software implementation, hardware implementation, hardware-software co-design, and optimized application mapping. The proposed methodology facilitates development and integration of signal processing hardware and software modules that involve heterogeneous programming languages and platforms. As a demonstration of the proposed design framework, we present a dataflow-based deep neural network (DNN) implementation for vehicle classification that is streamlined for real-time operation on embedded SoC devices. Using the proposed methodology, we apply and integrate a variety of dataflow graph optimizations that are important for efficient mapping of the DNN system into a resource constrained implementation that involves cooperating multicore CPUs and field-programmable gate array subsystems. Through experiments, we demonstrate the flexibility and effectiveness with which different design transformations can be applied and integrated across multiple scales of the targeted computing system

    Processamento de dados em Zynq APSoC

    Get PDF
    Mestrado em Engenharia de Computadores e TelemáticaField-Programmable Gate Arrays (FPGAs) were invented by Xilinx in 1985, i.e. less than 30 years ago. The influence of FPGAs on many directions in engineering is growing continuously and rapidly. There are many reasons for such progress and the most important are the inherent reconfigurability of FPGAs and relatively cheap development cost. Recent field-configurable micro-chips combine the capabilities of software and hardware by incorporating multi-core processors and reconfigurable logic enabling the development of highly optimized computational systems for a vast variety of practical applications, including high-performance computing, data, signal and image processing, embedded systems, and many others. In this context, the main goals of the thesis are to study the new micro-chips, namely the Zynq-7000 family and to apply them to two selected case studies: data sort and Hamming weight calculation for long vectors.Field-Programmable Gate Arrays (FPGAs) foram inventadas pela Xilinx em 1985, ou seja, há menos de 30 anos. A influência das FPGAs está a crescer continua e rapidamente em muitos ramos de engenharia. Há varias razões para esta evolução, as mais importantes são a sua capacidade de reconfiguração inerente e os baixos custos de desenvolvimento. Os micro-chips mais recentes baseados em FPGAs combinam capacidades de software e hardware através da incorporação de processadores multi-core e lógica reconfigurável permitindo o desenvolvimento de sistemas computacionais altamente otimizados para uma grande variedade de aplicações práticas, incluindo computação de alto desempenho, processamento de dados, de sinal e imagem, sistemas embutidos, e muitos outros. Neste contexto, este trabalho tem como o objetivo principal estudar estes novos micro-chips, nomeadamente a família Zynq-7000, para encontrar as melhores formas de potenciar as vantagens deste sistema usando casos de estudo como ordenação de dados e cálculo do peso de Hamming para vetores longos

    Estimation par analyse statique de la bande-passante d'accélérateurs en synthèse de haut niveau sur FPGA

    Get PDF
    L’accélération par coprocesseur sur FPGA de portions d’algorithmes logiciels exécutés sur un CPU à usage général est une solution utilisée depuis longtemps dans de nombreux systèmes embarqués lorsque le calcul à effectuer est trop complexe ou la quantité de données à traiter trop grande pour être réalisée par ce processeur trop général pour les contraintes de performance et de puissance données. Avec la fin de la loi de Moore, c’est également une option de plus en plus utilisée dans les centres de données pour pallier à la croissance exponentielle de la consommation de courant des approches CPU et GPGPU. De plus, la réalisation de ces coprocesseurs, bien que restant une tâche plus complexe que la simple programmation d’un processeur, est énormément facilitée par la démocratisation des logiciels de synthèse de haut niveau (HLS), qui permettent la transformation automatisée de code écrit en langages logiciels (généralement un sous-ensemble statique du C/C++) vers des langages de description matérielle synthétisables (VHDL/Verilog). Bien qu’il soit souvent nécessaire d’apporter des modifications au code source pour obtenir de bons résultats, les outils de synthèse de haut niveau comportent généralement un estimateur de performance rapide de la micro-architecture développée, ce qui facilite un flot de développement itératif. Cependant, en pratique, le potentiel de parallélisme et de concurrence des accélérateurs sur FPGA est souvent limité par la bande-passante vers la mémoire contenant les données à traiter ou par la latence des communications entre l’accélérateur et le processeur général qui le contrôle. De plus, l’estimation de cette bande-passante est un problème plus complexe qu’il ne paraît du premier coup d’œil, dépendant notamment de la taille et de la séquentialité des accès, du nombre d’accès simultanés, de la fréquence des différentes composantes du système, etc. Cette bande-passante varie également d’une configuration de contrôleur mémoire à une autre et le tout se complexifie avec les FPGA-SoC (SoC incluant processeurs physiques et partie logique programmable), qui comportent plusieurs chemins des données fixes différents vers leur partie FPGA. Finalement, dans la majorité des cas, la bande-passante atteignable est plus faible que le maximum théorique fourni avec la documentation du fabricant. Cette problématique fait en sorte que bien que les outils existants permettent d’estimer facilement la performance du coprocesseur isolé, cette estimation ne peut être fiable sans considérer comment il est connecté au système mémoire. Les seuls moyens d’avoir des métriques de performance fiables sont donc la simulation ou la synthèse et exécution du système complet. Cependant, alors que l’estimation de performance du coprocesseur isolé ne prend que quelques secondes, la simulation ou la synthèse augmente ce délai à quelques dizaines de minutes, ce qui augmente le temps de mise en marché ou mène à l’utilisation de solutions sous-optimales faute de temps de développement.----------ABSTRACT: FPGA acceleration of portions of code otherwise executed on a general purpose processor is a well known and frequently used solution for speeding up the execution of complex and data-heavy algorithms. This has been the case for around two decades in embedded systems, where power constraints limit the usefulness of inefficient general purpose solutions. However, with the end of Dennard scaling and Moore’s law, FPGA acceleration is also increasingly used in datacenters, where traditional CPU and GPGPU approaches are limited by the always increasing current consumption required by many modern applications such as big data and machine learning. Furthermore, the design of FPGA coprocessors, while still more complex than writing software, is facilitated by the recent democratization of High-Level Synthesis (HLS) tools, which allow the automated translation of high-level software to a hardware description (VHDL/Verilog) equivalent. While it is still generally necessary to modify the high-level code in order to produce good results, HLS tools usually ship with a fast performance estimator of the resulting micro-architecture, allowing for fast iterative development methodologies. However, while FPGAs have great potential for parallelism and concurrence, in practice they are often limited by memory bandwidth and/or by the communications latency between the coprocessor and the general purpose CPU controlling it. In addition, estimating this memory bandwidth is much more complex than it can appear at first glance, since it depends on the size of the data transfer, the order of the accesses, the number of simultaneous accesses to memory, the width of the accessed data, the clock speed of both the FPGA and the memory, etc. This bandwidth also differs from one memory controller configuration to the other, and then everything is made more complex when SoC-FPGAs (SoCs including a hard processor and programmable logic) come into play, since they contain multiple different datapaths between the programmable logic and the hard memory controller. Finally, this bandwidth is almost always different (and smaller) than the maximum theoretical bandwidth given by the manufacturer’s documentation. Thus, while existing HLS tools can easily estimate the coprocessor’s performance if it is isolated from the rest of the system, they do not take into account how this performance is affected by the achievable memory bandwidth. This makes the simulation of the whole system or its synthesis-then-execution the only trustworthy ways to get a good performance estimation. However, while the HLS tool’s performance estimation runtime is a matter of a few seconds, simulation or synthesis takes tens of minutes, which considerably slows down iterative development flows. This increased delay increases time-to-market and can lead to suboptimal solutions due to the extra development time needed

    BIG DATA и анализ высокого уровня : материалы конференции

    Get PDF
    В сборнике опубликованы результаты научных исследований и разработок в области BIG DATA and Advanced Analytics для оптимизации IT-решений и бизнес-решений, а также тематических исследований в области медицины, образования и экологии
    corecore