4,220 research outputs found

    Maximizing CNN Accelerator Efficiency Through Resource Partitioning

    Full text link
    Convolutional neural networks (CNNs) are revolutionizing machine learning, but they present significant computational challenges. Recently, many FPGA-based accelerators have been proposed to improve the performance and efficiency of CNNs. Current approaches construct a single processor that computes the CNN layers one at a time; the processor is optimized to maximize the throughput at which the collection of layers is computed. However, this approach leads to inefficient designs because the same processor structure is used to compute CNN layers of radically varying dimensions. We present a new CNN accelerator paradigm and an accompanying automated design methodology that partitions the available FPGA resources into multiple processors, each of which is tailored for a different subset of the CNN convolutional layers. Using the same FPGA resources as a single large processor, multiple smaller specialized processors increase computational efficiency and lead to a higher overall throughput. Our design methodology achieves 3.8x higher throughput than the state-of-the-art approach on evaluating the popular AlexNet CNN on a Xilinx Virtex-7 FPGA. For the more recent SqueezeNet and GoogLeNet, the speedups are 2.2x and 2.0x

    Comprehensive Evaluation of OpenCL-based Convolutional Neural Network Accelerators in Xilinx and Altera FPGAs

    Get PDF
    Deep learning has significantly advanced the state of the art in artificial intelligence, gaining wide popularity from both industry and academia. Special interest is around Convolutional Neural Networks (CNN), which take inspiration from the hierarchical structure of the visual cortex, to form deep layers of convolutional operations, along with fully connected classifiers. Hardware implementations of these deep CNN architectures are challenged with memory bottlenecks that require many convolution and fully-connected layers demanding large amount of communication for parallel computation. Multi-core CPU based solutions have demonstrated their inadequacy for this problem due to the memory wall and low parallelism. Many-core GPU architectures show superior performance but they consume high power and also have memory constraints due to inconsistencies between cache and main memory. FPGA design solutions are also actively being explored, which allow implementing the memory hierarchy using embedded BlockRAM. This boosts the parallel use of shared memory elements between multiple processing units, avoiding data replicability and inconsistencies. This makes FPGAs potentially powerful solutions for real-time classification of CNNs. Both Altera and Xilinx have adopted OpenCL co-design framework from GPU for FPGA designs as a pseudo-automatic development solution. In this paper, a comprehensive evaluation and comparison of Altera and Xilinx OpenCL frameworks for a 5-layer deep CNN is presented. Hardware resources, temporal performance and the OpenCL architecture for CNNs are discussed. Xilinx demonstrates faster synthesis, better FPGA resource utilization and more compact boards. Altera provides multi-platforms tools, mature design community and better execution times

    Configurable 3D-integrated focal-plane sensor-processor array architecture

    Get PDF
    A mixed-signal Cellular Visual Microprocessor architecture with digital processors is described. An ASIC implementation is also demonstrated. The architecture is composed of a regular sensor readout circuit array, prepared for 3D face-to-face type integration, and one or several cascaded array of mainly identical (SIMD) processing elements. The individual array elements derived from the same general HDL description and could be of different in size, aspect ratio, and computing resources

    Photonic integration enabling new multiplexing concepts in optical board-to-board and rack-to-rack interconnects

    Get PDF
    New broadband applications are causing the datacenters to proliferate, raising the bar for higher interconnection speeds. So far, optical board-to-board and rack-to-rack interconnects relied primarily on low-cost commodity optical components assembled in a single package. Although this concept proved successful in the first generations of optical-interconnect modules, scalability is a daunting issue as signaling rates extend beyond 25 Gb/s. In this paper we present our work towards the development of two technology platforms for migration beyond Infiniband enhanced data rate (EDR), introducing new concepts in board-to-board and rack-to-rack interconnects. The first platform is developed in the framework of MIRAGE European project and relies on proven VCSEL technology, exploiting the inherent cost, yield, reliability and power consumption advantages of VCSELs. Wavelength multiplexing, PAM-4 modulation and multi-core fiber (MCF) multiplexing are introduced by combining VCSELs with integrated Si and glass photonics as well as BiCMOS electronics. An in-plane MCF-to-SOI interface is demonstrated, allowing coupling from the MCF cores to 340x400 nm Si waveguides. Development of a low-power VCSEL driver with integrated feed-forward equalizer is reported, allowing PAM-4 modulation of a bandwidth-limited VCSEL beyond 25 Gbaud. The second platform, developed within the frames of the European project PHOXTROT, considers the use of modulation formats of increased complexity in the context of optical interconnects. Powered by the evolution of DSP technology and towards an integration path between inter and intra datacenter traffic, this platform investigates optical interconnection system concepts capable to support 16QAM 40GBd data traffic, exploiting the advancements of silicon and polymer technologies

    Efficient Neural Network Implementations on Parallel Embedded Platforms Applied to Real-Time Torque-Vectoring Optimization Using Predictions for Multi-Motor Electric Vehicles

    Get PDF
    The combination of machine learning and heterogeneous embedded platforms enables new potential for developing sophisticated control concepts which are applicable to the field of vehicle dynamics and ADAS. This interdisciplinary work provides enabler solutions -ultimately implementing fast predictions using neural networks (NNs) on field programmable gate arrays (FPGAs) and graphical processing units (GPUs)- while applying them to a challenging application: Torque Vectoring on a multi-electric-motor vehicle for enhanced vehicle dynamics. The foundation motivating this work is provided by discussing multiple domains of the technological context as well as the constraints related to the automotive field, which contrast with the attractiveness of exploiting the capabilities of new embedded platforms to apply advanced control algorithms for complex control problems. In this particular case we target enhanced vehicle dynamics on a multi-motor electric vehicle benefiting from the greater degrees of freedom and controllability offered by such powertrains. Considering the constraints of the application and the implications of the selected multivariable optimization challenge, we propose a NN to provide batch predictions for real-time optimization. This leads to the major contribution of this work: efficient NN implementations on two intrinsically parallel embedded platforms, a GPU and a FPGA, following an analysis of theoretical and practical implications of their different operating paradigms, in order to efficiently harness their computing potential while gaining insight into their peculiarities. The achieved results exceed the expectations and additionally provide a representative illustration of the strengths and weaknesses of each kind of platform. Consequently, having shown the applicability of the proposed solutions, this work contributes valuable enablers also for further developments following similar fundamental principles.Some of the results presented in this work are related to activities within the 3Ccar project, which has received funding from ECSEL Joint Undertaking under grant agreement No. 662192. This Joint Undertaking received support from the European Union’s Horizon 2020 research and innovation programme and Germany, Austria, Czech Republic, Romania, Belgium, United Kingdom, France, Netherlands, Latvia, Finland, Spain, Italy, Lithuania. This work was also partly supported by the project ENABLES3, which received funding from ECSEL Joint Undertaking under grant agreement No. 692455-2
    • 

    corecore