208 research outputs found

    Optimizing the FPGA memory design for a Sobel edge detector

    Get PDF
    This paper explores different memory systems by investigating the trade-offs involved with choosing one memory system over another on an FPGA. As an example, we use a Sobel edge detector to look at the trade-offs for different memory components. We demonstrate how each type of memory affects I/O performance and area. By exploiting these trade-offs in performance and area a designer should be able to find an optimum on-chip memory system for a given application

    Optimizing the FPGA memory design for a Sobel edge detector

    Get PDF
    This research explored different memory systems on FPGA chips in order to show the various trade-offs involved with choosing one memory system over another. We explored the different memory components that are found on FPGA chips using the example of a Sobel edge detector. We demonstrated how the different FPGA chip’s memories affected I/O performance and area. By exploiting the trade-offs between these a designer should be able to find an optimal on-chip memory system for a given application. Given further study, we believe we can develop application-specific memory templates that can be used with a hardware compiler to generate optimal on-chip memory system

    Data reuse buffer synthesis using the polyhedral model

    Get PDF
    Current high-level synthesis (HLS) tools for the automatic design of computing hardware perform excellently for the synthesis of computation kernels, but they often do not optimize memory bandwidth. As accessing memory is a bottleneck in many algorithms, the performance of the generated circuit could benefit substantially from memory access optimization. In this paper, we present a method and a tool to automate the optimization of memory accesses to array data in HLS by introducing local memory tailored perfectly to store only the data that are used repeatedly. Our method detects data reuse in the source code of the algorithm to be implemented in hardware, selects and parameterizes data reuse buffers, and generates a register transfer level design of the data buffers and a matching loop controller that coordinates reuse buffers and datapath operations. Throughout this paper, the polyhedral representation is used extensively as it proves to be well suited for calculations on loop nests and data accesses. As a consequence, this paper is limited to affine programs which can be represented in this model. Experiments show that our method outperforms state-of-the-art academic and commercial HLS tools

    Area Optimized FPGA-Based Implementation of The Sobel Compass Edge Detector

    Get PDF

    OPTIMIZATION OF FPGA-BASED PROCESSOR ARCHITECTURE FOR SOBEL EDGE DETECTION OPERATOR

    Get PDF
    This dissertation introduces an optimized processor architecture for Sobel edge detection operator on field programmable gate arrays (FPGAs). The processor is optimized by the use of several optimization techniques that aim to increase the processor throughput and reduce the processor logic utilization and memory usage. FPGAs offer high levels of parallelism which is exploited by the processor to implement the parallel process of edge detection in order to increase the processor throughput and reduce the logic utilization. To achieve this, the proposed processor consists of several Sobel instances that are able to produce multiple output pixels in parallel. This parallelism enables data reuse within the processor block. Moreover, the processor gains performance with a factor equal to the number of instances contained in the processor block. The processor that consists of one row of Sobel instances exploits data reuse within one image line in the calculations of the horizontal gradient. Data reuse within one and multiple image lines is enabled by using a processor with multiple rows of Sobel instances which allow the reuse of both the horizontal and vertical gradients. By the application of the optimization techniques, the proposed Sobel processor is able to meet real-time performance constraints due to its high throughput even with a considerably low clock frequency. In addition, logic utilization of the processor is low compared to other Sobel processors when implemented on ALTERA Cyclone II DE2-70

    Software Defined Multi-Spectral Imaging for Arctic Sensor Networks

    Get PDF
    Availability of off-the-shelf infrared sensors combined with high definition visible cameras has made possible the construction of a Software Defined Multi-Spectral Imager (SDMSI) combining long-wave, near-infrared and visible imaging. The SDMSI requires a real-time embedded processor to fuse images and to create real-time depth maps for opportunistic uplink in sensor networks. Researchers at Embry Riddle Aeronautical University working with University of Alaska Anchorage at the Arctic Domain Awareness Center and the University of Colorado Boulder have built several versions of a low-cost drop-in-place SDMSI to test alternatives for power efficient image fusion. The SDMSI is intended for use in field applications including marine security, search and rescue operations and environmental surveys in the Arctic region. Based on Arctic marine sensor network mission goals, the team has designed the SDMSI to include features to rank images based on saliency and to provide on camera fusion and depth mapping. A major challenge has been the design of the camera computing system to operate within a 10 to 20 Watt power budget. This paper presents a power analysis of three options: 1) multi-core, 2) field programmable gate array with multi-core, and 3) graphics processing units with multi-core. For each test, power consumed for common fusion workloads has been measured at a range of frame rates and resolutions. Detailed analyses from our power efficiency comparison for workloads specific to stereo depth mapping and sensor fusion are summarized. Preliminary mission feasibility results from testing with off-the-shelf long-wave infrared and visible cameras in Alaska and Arizona are also summarized to demonstrate the value of the SDMSI for applications such as ice tracking, ocean color, soil moisture, animal and marine vessel detection and tracking. The goal is to select the most power efficient solution for the SDMSI for use on UAVs (Unoccupied Aerial Vehicles) and other drop-in-place installations in the Arctic. The prototype selected will be field tested in Alaska in the summer of 2016

    Towards a tighter integration of generated and custom-made hardware

    Get PDF
    Most of today's high-level synthesis tools offer a fixed set of interfaces to communicate with the outer world. A direct integration of custom IF in the datapath would often be more beneficial than an integration using such communication interfaces. If a certain interface protocol is not offered by the tool, either translation blocks (wrappers) are needed or the code should be written at a lower level. The former solution may hurt the performance, while the latter one is often impossible using an untimed high-level description. In this paper interface protocols or sets of JP core accesses are first described at a low level as sets of operations with scheduling information (macros). During the synthesis process, corresponding function calls are mapped to these macros. This facilitates the integration of custom-made hardware and hardware generated by high-level synthesis tools

    FPGA Accelerators on Heterogeneous Systems: An Approach Using High Level Synthesis

    Get PDF
    La evolución de las FPGAs como dispositivos para el procesamiento con alta eficiencia energética y baja latencia de control, comparada con dispositivos como las CPUs y las GPUs, las han hecho atractivas en el ámbito de la computación de alto rendimiento (HPC).A pesar de las inumerables ventajas de las FPGAs, su inclusión en HPC presenta varios retos. El primero, la complejidad que supone la programación de las FPGAs comparada con dispositivos como las CPUs y las GPUs. Segundo, el tiempo de desarrollo es alto debido al proceso de síntesis del hardware. Y tercero, trabajar con más arquitecturas en HPC requiere el manejo y la sintonización de los detalles de cada dispositivo, lo que añade complejidad.Esta tesis aborda estos 3 problemas en diferentes niveles con el objetivo de mejorar y facilitar la adopción de las FPGAs usando la síntesis de alto nivel(HLS) en sistemas HPC.En un nivel próximo al hardware, en esta tesis se desarrolla un modelo analítico para las aplicaciones limitadas en memoria, que es una situación común en aplicaciones de HPC. El modelo, desarrollado para kernels programados usando HLS, puede predecir el tiempo de ejecución con alta precisión y buena adaptabilidad ante cambios en la tecnología de la memoria, como las DDR4 y HBM2, y en las variaciones en la frecuencia del kernel. Esta solución puede aumentar potencialmente la productividad de las personas que programan, reduciendo el tiempo de desarrollo y optimización de las aplicaciones.Entender los detalles de bajo nivel puede ser complejo para las programadoras promedio, y el desempeño de las aplicaciones para FPGA aún requiere un alto nivel en las habilidades de programación. Por ello, nuestra segunda propuesta está enfocada en la extensión de las bibliotecas con una propuesta para cómputo en visión artificial que sea portable entre diferentes fabricantes de FPGAs. La biblioteca se ha diseñado basada en templates, lo que permite una biblioteca que da flexibilidad a la generación del hardware y oculta decisiones de diseño críticas como la comunicación entre nodos, el modelo de concurrencia, y la integración de las aplicaciones en el sistema heterogéneo para facilitar el desarrollo de grafos de visión artificial que pueden ser complejos.Finalmente, en el runtime del host del sistema heterogéneo, hemos integrado la FPGA para usarla de forma trasparente como un dispositivo acelerador para la co-ejecución en sistemas heterogéneos. Hemos hecho una serie propuestas de altonivel de abstracción que abarca los mecanismos de sincronización y políticas de balanceo en un sistema altamente heterogéneo compuesto por una CPU, una GPU y una FPGA. Se presentan los principales retos que han inspirado esta investigación y los beneficios de la inclusión de una FPGA en rendimiento y energía.En conclusión, esta tesis contribuye a la adopción de las FPGAs para entornos HPC, aportando soluciones que ayudan a reducir el tiempo de desarrollo y mejoran el desempeño y la eficiencia energética del sistema.---------------------------------------------The emergence of FPGAs in the High-Performance Computing domain is arising thanks to their promise of better energy efficiency and low control latency, compared with other devices such as CPUs or GPUs.Albeit these benefits, their complete inclusion into HPC systems still faces several challenges. First, FPGA complexity means its programming more difficult compared to devices such as CPU and GPU. Second, the development time is longer due to the required synthesis effort. And third, working with multiple devices increments the details that should be managed and increase hardware complexity.This thesis tackles these 3 problems at different stack levels to improve and to make easier the adoption of FPGAs using High-Level Synthesis on HPC systems. At a close to the hardware level, this thesis contributes with a new analytical model for memory-bound applications, an usual situation for HPC applications. The model for HLS kernels can anticipate application performance before place and route, reducing the design development time. Our results show a high precision and adaptable model for external memory technologies such as DDR4 and HBM2, and kernel frequency changes. This solution potentially increases productivity, reducing application development time.Understanding low-level implementation details is difficult for average programmers, and the development of FPGA applications still requires high proficiency program- ming skills. For this reason, the second proposal is focused on the extension of a computer vision library to be portable among two of the main FPGA vendors. The template-based library allows hardware flexibility and hides design decisions such as the communication among nodes, the concurrency programming model, and the application’s integration in the heterogeneous system, to develop complex vision graphs easily.Finally, we have transparently integrated the FPGA in a high level framework for co-execution with other devices. We propose a set of high level abstractions covering synchronization mechanism and load balancing policies in a highly heterogeneous system with CPU, GPU, and FPGA devices. We present the main challenges that inspired this research and the benefits of the FPGA use demonstrating performance and energy improvements.<br /

    Computer Vision System-On-Chip Designs for Intelligent Vehicles

    Get PDF
    Intelligent vehicle technologies are growing rapidly that can enhance road safety, improve transport efficiency, and aid driver operations through sensors and intelligence. Advanced driver assistance system (ADAS) is a common platform of intelligent vehicle technologies. Many sensors like LiDAR, radar, cameras have been deployed on intelligent vehicles. Among these sensors, optical cameras are most widely used due to their low costs and easy installation. However, most computer vision algorithms are complicated and computationally slow, making them difficult to be deployed on power constraint systems. This dissertation investigates several mainstream ADAS applications, and proposes corresponding efficient digital circuits implementations for these applications. This dissertation presents three ways of software / hardware algorithm division for three ADAS applications: lane detection, traffic sign classification, and traffic light detection. Using FPGA to offload critical parts of the algorithm, the entire computer vision system is able to run in real time while maintaining a low power consumption and a high detection rate. Catching up with the advent of deep learning in the field of computer vision, we also present two deep learning based hardware implementations on application specific integrated circuits (ASIC) to achieve even lower power consumption and higher accuracy. The real time lane detection system is implemented on Xilinx Zynq platform, which has a dual core ARM processor and FPGA fabric. The Xilinx Zynq platform integrates the software programmability of an ARM processor with the hardware programmability of an FPGA. For the lane detection task, the FPGA handles the majority of the task: region-of-interest extraction, edge detection, image binarization, and hough transform. After then, the ARM processor takes in hough transform results and highlights lanes using the hough peaks algorithm. The entire system is able to process 1080P video stream at a constant speed of 69.4 frames per second, realizing real time capability. An efficient system-on-chip (SOC) design which classifies up to 48 traffic signs in real time is presented in this dissertation. The traditional histogram of oriented gradients (HoG) and support vector machine (SVM) are proven to be very effective on traffic sign classification with an average accuracy rate of 93.77%. For traffic sign classification, the biggest challenge comes from the low execution efficiency of the HoG on embedded processors. By dividing the HoG algorithm into three fully pipelined stages, as well as leveraging extra on-chip memory to store intermediate results, we successfully achieved a throughput of 115.7 frames per second at 1080P resolution. The proposed generic HoG hardware implementation could also be used as an individual IP core by other computer vision systems. A real time traffic signal detection system is implemented to present an efficient hardware implementation of the traditional grass-fire blob detection. The traditional grass-fire blob detection method iterates the input image multiple times to calculate connected blobs. In digital circuits, five extra on-chip block memories are utilized to save intermediate results. By using additional memories, all connected blob information could be obtained through one-pass image traverse. The proposed hardware friendly blob detection can run at 72.4 frames per second with 1080P video input. Applying HoG + SVM as feature extractor and classifier, 92.11% recall rate and 99.29% precision rate are obtained on red lights, and 94.44% recall rate and 98.27% precision rate on green lights. Nowadays, convolutional neural network (CNN) is revolutionizing computer vision due to learnable layer by layer feature extraction. However, when coming into inference, CNNs are usually slow to train and slow to execute. In this dissertation, we studied the implementation of principal component analysis based network (PCANet), which strikes a balance between algorithm robustness and computational complexity. Compared to a regular CNN, the PCANet only needs one iteration training, and typically at most has a few tens convolutions on a single layer. Compared to hand-crafted features extraction methods, the PCANet algorithm well reflects the variance in the training dataset and can better adapt to difficult conditions. The PCANet algorithm achieves accuracy rates of 96.8% and 93.1% on road marking detection and traffic light detection, respectively. Implementing in Synopsys 32nm process technology, the proposed chip can classify 724,743 32-by-32 image candidates in one second, with only 0.5 watt power consumption. In this dissertation, binary neural network (BNN) is adopted as a potential detector for intelligent vehicles. The BNN constrains all activations and weights to be +1 or -1. Compared to a CNN with the same network configuration, the BNN achieves 50 times better resource usage with only 1% - 2% accuracy loss. Taking car detection and pedestrian detection as examples, the BNN achieves an average accuracy rate of over 95%. Furthermore, a BNN accelerator implemented in Synopsys 32nm process technology is presented in our work. The elastic architecture of the BNN accelerator makes it able to process any number of convolutional layers with high throughput. The BNN accelerator only consumes 0.6 watt and doesn\u27t rely on external memory for storage
    corecore