8,652 research outputs found

    Comprehensive Evaluation of OpenCL-based Convolutional Neural Network Accelerators in Xilinx and Altera FPGAs

    Get PDF
    Deep learning has significantly advanced the state of the art in artificial intelligence, gaining wide popularity from both industry and academia. Special interest is around Convolutional Neural Networks (CNN), which take inspiration from the hierarchical structure of the visual cortex, to form deep layers of convolutional operations, along with fully connected classifiers. Hardware implementations of these deep CNN architectures are challenged with memory bottlenecks that require many convolution and fully-connected layers demanding large amount of communication for parallel computation. Multi-core CPU based solutions have demonstrated their inadequacy for this problem due to the memory wall and low parallelism. Many-core GPU architectures show superior performance but they consume high power and also have memory constraints due to inconsistencies between cache and main memory. FPGA design solutions are also actively being explored, which allow implementing the memory hierarchy using embedded BlockRAM. This boosts the parallel use of shared memory elements between multiple processing units, avoiding data replicability and inconsistencies. This makes FPGAs potentially powerful solutions for real-time classification of CNNs. Both Altera and Xilinx have adopted OpenCL co-design framework from GPU for FPGA designs as a pseudo-automatic development solution. In this paper, a comprehensive evaluation and comparison of Altera and Xilinx OpenCL frameworks for a 5-layer deep CNN is presented. Hardware resources, temporal performance and the OpenCL architecture for CNNs are discussed. Xilinx demonstrates faster synthesis, better FPGA resource utilization and more compact boards. Altera provides multi-platforms tools, mature design community and better execution times

    Comprehensive Evaluation of OpenCL-Based CNN Implementations for FPGAs

    Get PDF
    Deep learning has significantly advanced the state of the art in artificial intelligence, gaining wide popularity from both industry and academia. Special interest is around Convolutional Neural Networks (CNN), which take inspiration from the hierarchical structure of the visual cortex, to form deep layers of convolutional operations, along with fully connected classifiers. Hardware implementations of these deep CNN architectures are challenged with memory bottlenecks that require many convolution and fully-connected layers demanding large amount of communication for parallel computation. Multi-core CPU based solutions have demonstrated their inadequacy for this problem due to the memory wall and low parallelism. Many-core GPU architectures show superior performance but they consume high power and also have memory constraints due to inconsistencies between cache and main memory. OpenCL is commonly used to describe these architectures for their execution on GPGPUs or FPGAs. FPGA design solutions are also actively being explored, which allow implementing the memory hierarchy using embedded parallel BlockRAMs. This boosts the parallel use of shared memory elements between multiple processing units, avoiding data replicability and inconsistencies. This makes FPGAs potentially powerful solutions for real-time classification of CNNs. In this paper both Altera and Xilinx adopted OpenCL co-design frameworks for pseudo-automatic development solutions are evaluated. A comprehensive evaluation and comparison for a 5-layer deep CNN is presented. Hardware resources, temporal performance and the OpenCL architecture for CNNs are discussed. Xilinx demonstrates faster synthesis, better FPGA resource utilization and more compact boards. Altera provides multi-platforms tools, mature design community and better execution times.Ministerio de Economía y Competitividad TEC2016-77785-

    Precise motion descriptors extraction from stereoscopic footage using DaVinci DM6446

    Get PDF
    A novel approach to extract target motion descriptors in multi-camera video surveillance systems is presented. Using two static surveillance cameras with partially overlapped field of view (FOV), control points (unique points from each camera) are identified in regions of interest (ROI) from both cameras footage. The control points within the ROI are matched for correspondence and a meshed Euclidean distance based signature is computed. A depth map is estimated using disparity of each control pair and the ROI is graded into number of regions with the help of relative depth information of the control points. The graded regions of different depths will help calculate accurately the pace of the moving target and also its 3D location. The advantage of estimating a depth map for background static control points over depth map of the target itself is its accuracy and robustness to outliers. The performance of the algorithm is evaluated in the paper using several test sequences. Implementation issues of the algorithm onto the TI DaVinci DM6446 platform are considered in the paper

    Design and application of a multi-modal process tomography system

    Get PDF
    This paper presents a design and application study of an integrated multi-modal system designed to support a range of common modalities: electrical resistance, electrical capacitance and ultrasonic tomography. Such a system is designed for use with complex processes that exhibit behaviour changes over time and space, and thus demand equally diverse sensing modalities. A multi-modal process tomography system able to exploit multiple sensor modes must permit the integration of their data, probably centred upon a composite process model. The paper presents an overview of this approach followed by an overview of the systems engineering and integrated design constraints. These include a range of hardware oriented challenges: the complexity and specificity of the front end electronics for each modality; the need for front end data pre-processing and packing; the need to integrate the data to facilitate data fusion; and finally the features to enable successful fusion and interpretation. A range of software aspects are also reviewed: the need to support differing front-end sensors for each modality in a generic fashion; the need to communicate with front end data pre-processing and packing systems; the need to integrate the data to allow data fusion; and finally to enable successful interpretation. The review of the system concepts is illustrated with an application to the study of a complex multi-component process

    Low-Complexity Hyperspectral Image Compression on a Multi-tiled Architecture

    Get PDF
    The increasing amount of data produced in satellites poses a downlink communication problem due to the limited data rate of the downlink. This bottleneck is solved by introducing more and more processing power on-board to compress data to a satisfiable rate. Currently, this processing power is often provided by custom off the shelf hardware which is needed to run the complex image compression standards. The increase in required processing power often increases the energy required to power the hardware. This in turn pushes algorithm developers to develop lower complexity algorithms which are able to compress the data for the least amount of processing per data element. On the other hand hardware developers are pushed to develop flexible hardware which can be used on multiple missions to cut development cost and can be re-used for different missions. This paper introduces an algorithm which has been developed\ud to compress hyperspectral images at low complexity and describes its mapping to a new hardware platform which has been developed to offer flexibility as well as high performance processing power called the Xentium tile processor
    corecore