23 research outputs found

    Domain-specific and reconfigurable instruction cells based architectures for low-power SoC

    Get PDF

    Quantization-Aware NN Layers with High-throughput FPGA Implementation for Edge AI

    Get PDF
    Over the past few years, several applications have been extensively exploiting the advantages of deep learning, in particular when using convolutional neural networks (CNNs). The intrinsic flexibility of such models makes them widely adopted in a variety of practical applications, from medical to industrial. In this latter scenario, however, using consumer Personal Computer (PC) hardware is not always suitable for the potential harsh conditions of the working environment and the strict timing that industrial applications typically have. Therefore, the design of custom FPGA (Field Programmable Gate Array) solutions for network inference is gaining massive attention from researchers and companies as well. In this paper, we propose a family of network architectures composed of three kinds of custom layers working with integer arithmetic with a customizable precision (down to just two bits). Such layers are designed to be effectively trained on classical GPUs (Graphics Processing Units) and then synthesized to FPGA hardware for real-time inference. The idea is to provide a trainable quantization layer, called Requantizer, acting both as a non-linear activation for neurons and a value rescaler to match the desired bit precision. This way, the training is not only quantization-aware, but also capable of estimating the optimal scaling coefficients to accommodate both the non-linear nature of the activations and the constraints imposed by the limited precision. In the experimental section, we test the performance of this kind of model while working both on classical PC hardware and a case-study implementation of a signal peak detection device running on a real FPGA. We employ TensorFlow Lite for training and comparison, and use Xilinx FPGAs and Vivado for synthesis and implementation. The results show an accuracy of the quantized networks close to the floating point version, without the need for representative data for calibration as in other approaches, and performance that is better than dedicated peak detection algorithms. The FPGA implementation is able to run in real time at a rate of four gigapixels per second with moderate hardware resources, while achieving a sustained efficiency of 0.5 TOPS/W (tera operations per second per watt), in line with custom integrated hardware accelerators

    SIMD pipelined processor implemented on a FPGA

    Get PDF
    The goal of this thesis was to create a processor using VHDL that could be used for educational purposes as well as a stepping stone in creating a reconfigurable system for digital signal processing or image processing applications. To do this a subset of MIPS instructions were chosen to demonstrate functionality within a five stage pipeline (instruction fetch, instruction decode, execution, memory, and write back) processor in simulation and in synthesis. A hazard controller was implemented to handle data forwarding and stalling. The basic MIPS architecture was extended by adding singlecycle multiplication functionality and single-cycle SIMD instructions. The architecture contains parameters for easy modification of SIMD units depending on the needs of the processor. The SIMD architecture was designed with distributed memory so that every memory unit received the same address. This simplifies the address logic so that the processor does not have to use a complex addressing mode. The memory can be pictured as row and columns method of access. The SIMD instructions were chosen to be able to perform binary operations to implement future morphological operations and to use the multiply and add operations for implementing MACs to perform convolution and filtering operations in future image processing applications. The board being used to verify the processor was a Xilinx University Program (XUP) board that contains Xilinx Virtex II Pro XC2VP30 FPGA, package FF896. The maximum number of units that can be instantiated in the FPGA on the XUP board is eight units which would use the entire FPGA slice area. This allows the processor to complete eight sets of 32-bit data operations per cycle when the SIMD pipeline is full. The design was shown to operate at the maximum speed of 100 MHz and utilize all the area of the FPGA. The processor was verified in both simulation and synthesis. The new soft-core 32-bit SIMD processor extends existing soft-core processors in that it provides a reconfigurable SIMD-pipeline allowing it to operate on multiple inputs concurrently, with 32-bit operands and a single-cycle throughput

    Terabit Burst Switching Final Report

    Get PDF
    This is the final report For Washington University\u27s Terabit Burst Switching Project, supported by DARPA and Rome Air Force Laboratory. The primary objective of the project has been to demonstrate the feasibility of Burst Switching, a new data communication service, which seeks to more effectively exploit the large bandwidths becoming available in WDM transmission systems. Burst switching systems dynamically assign data bursts to channels in optical datalinks, using routing information carried in parallel control channels

    Reconfigurable Instruction Cell Architecture Reconfiguration and Interconnects

    Get PDF

    Efficient machine learning: models and accelerations

    Get PDF
    One of the key enablers of the recent unprecedented success of machine learning is the adoption of very large models. Modern machine learning models typically consist of multiple cascaded layers such as deep neural networks, and at least millions to hundreds of millions of parameters (i.e., weights) for the entire model. The larger-scale model tend to enable the extraction of more complex high-level features, and therefore, lead to a significant improvement of the overall accuracy. On the other side, the layered deep structure and large model sizes also demand to increase computational capability and memory requirements. In order to achieve higher scalability, performance, and energy efficiency for deep learning systems, two orthogonal research and development trends have attracted enormous interests. The first trend is the acceleration while the second is the model compression. The underlying goal of these two trends is the high quality of the models to provides accurate predictions. In this thesis, we address these two problems and utilize different computing paradigms to solve real-life deep learning problems. To explore in these two domains, this thesis first presents the cogent confabulation network for sentence completion problem. We use Chinese language as a case study to describe our exploration of the cogent confabulation based text recognition models. The exploration and optimization of the cogent confabulation based models have been conducted through various comparisons. The optimized network offered a better accuracy performance for the sentence completion. To accelerate the sentence completion problem in a multi-processing system, we propose a parallel framework for the confabulation recall algorithm. The parallel implementation reduce runtime, improve the recall accuracy by breaking the fixed evaluation order and introducing more generalization, and maintain a balanced progress in status update among all neurons. A lexicon scheduling algorithm is presented to further improve the model performance. As deep neural networks have been proven effective to solve many real-life applications, and they are deployed on low-power devices, we then investigated the acceleration for the neural network inference using a hardware-friendly computing paradigm, stochastic computing. It is an approximate computing paradigm which requires small hardware footprint and achieves high energy efficiency. Applying this stochastic computing to deep convolutional neural networks, we design the functional hardware blocks and optimize them jointly to minimize the accuracy loss due to the approximation. The synthesis results show that the proposed design achieves the remarkable low hardware cost and power/energy consumption. Modern neural networks usually imply a huge amount of parameters which cannot be fit into embedded devices. Compression of the deep learning models together with acceleration attracts our attention. We introduce the structured matrices based neural network to address this problem. Circulant matrix is one of the structured matrices, where a matrix can be represented using a single vector, so that the matrix is compressed. We further investigate a more flexible structure based on circulant matrix, called block-circulant matrix. It partitions a matrix into several smaller blocks and makes each submatrix is circulant. The compression ratio is controllable. With the help of Fourier Transform based equivalent computation, the inference of the deep neural network can be accelerated energy efficiently on the FPGAs. We also offer the optimization for the training algorithm for block circulant matrices based neural networks to obtain a high accuracy after compression

    Spectrum Optimisation in Wireless Communication Systems: Technology Evaluation, System Design and Practical Implementation

    Get PDF
    Two key technology enablers for next generation networks are examined in this thesis, namely Cognitive Radio (CR) and Spectrally Efficient Frequency Division Multiplexing (SEFDM). The first part proposes the use of traffic prediction in CR systems to improve the Quality of Service (QoS) for CR users. A framework is presented which allows CR users to capture a frequency slot in an idle licensed channel occupied by primary users. This is achieved by using CR to sense and select target spectrum bands combined with traffic prediction to determine the optimum channel-sensing order. The latter part of this thesis considers the design, practical implementation and performance evaluation of SEFDM. The key challenge that arises in SEFDM is the self-created interference which complicates the design of receiver architectures. Previous work has focused on the development of sophisticated detection algorithms, however, these suffer from an impractical computational complexity. Consequently, the aim of this work is two-fold; first, to reduce the complexity of existing algorithms to make them better-suited for application in the real world; second, to develop hardware prototypes to assess the feasibility of employing SEFDM in practical systems. The impact of oversampling and fixed-point effects on the performance of SEFDM is initially determined, followed by the design and implementation of linear detection techniques using Field Programmable Gate Arrays (FPGAs). The performance of these FPGA based linear receivers is evaluated in terms of throughput, resource utilisation and Bit Error Rate (BER). Finally, variants of the Sphere Decoding (SD) algorithm are investigated to ameliorate the error performance of SEFDM systems with targeted reduction in complexity. The Fixed SD (FSD) algorithm is implemented on a Digital Signal Processor (DSP) to measure its computational complexity. Modified sorting and decomposition strategies are then applied to this FSD algorithm offering trade-offs between execution speed and BER

    Χρήση μοντέλου παράλληλου προγραμματισμού για σύνθεση αρχιτεκτονικών

    Get PDF
    The problem of automatically generating hardware modules from high level application representations has been at the forefront of EDA research during the last few years. In this Dissertation we introduce a methodology to automatically synthesize hardware accelerators from OpenCL applications. OpenCL is a recent industry supported standard for writing programs that execute on multicore platforms and accelerators such as GPUs. Our methodology maps OpenCL kernels into hardware accelerators based on architectural templates that explicitly decouple computation from memory communication whenever this is possible. The templates can be tuned to provide a wide repertoire of accelerators that meet user performance requirements and FPGA device characteristics. Furthermore a set of high- and low-level compiler optimizations is applied to generate optimized accelerators. Our experimental evaluation shows that the generated accelerators are tuned efficiently to match the applications memory access pattern and computational complexity and to achieve user performance requirements. An important objective of our tool is to expand the FPGA development user base to software engineers thereby expanding the scope of FPGAs beyond the realm of hardware design.To πρόβλημα της αυτόματης δημιουργίας μονάδων υλικό από παραστάσεις υψηλού επιπέδου εφαρμογής είναι στην πρώτη γραμμή της EDA έρευνας κατά τη διάρκεια των τελευταίων ετών. Σε αυτή την διατριβή παρουσιάζουμε μια μεθοδολογία για τη αυτόματη σύνθεση επιταχυντές υλικού από εφαρμογές OpenCL. OpenCL είναι ένα πρόσφατο πρότυπο για τη σύνταξη των προγραμμάτων που εκτελούνται σε πλατφόρμες πολλαπλών πυρήνων και επιταχυντές όπως GPUs. Η μεθοδολογία μας μετατρέπει προγράμματα OpenCL σε επιταχυντές υλικού με βάση αρχιτεκτονικά πρότυπα που ρητά αποσυνδέει τους υπολογισμούς από την μεταφορά δεδομένων από/προς την μνήμη όποτε αυτό είναι δυνατό. Τα πρότυπα μπορούν να συντονιστούν ώστε να παρέχουν ένα ευρύ ρεπερτόριο από επιταχυντές που πληρούν τις απαιτήσεις απόδοσης των χρηστών και τα χαρακτηριστικά της συσκευής FPGA. Επιπλέον ένα σύνολο υψηλής και χαμηλής στάθμης βελτιστοποιήσεις μεταγλωττιστή εφαρμόζεται για να παράγει βελτιστοποιημένα επιταχυντές. Η πειραματική αξιολόγηση δείχνει ότι οι επιταχυντές που δημιουργούνται αποτελεσματικά συντονισμένοι για να ταιριάζει με το μοτίβο πρόσβασης στην μνήμη κάθε εφαρμογής και την υπολογιστική πολυπλοκότητα και να επιτύχουν τις απαιτήσεις απόδοσης των χρηστών. Ένας σημαντικός στόχος του εργαλείου μας είναι η επέκταση της βάσης χρηστών πλατφόρμες FPGA για μηχανικούς λογισμικού ώστε να γίνει ανάπτυξη FPGA συστήματα από μηχανικούς λογισμικού χωρίς την ανάγκη για εμπειρία σχεδιασμού υλικού

    The Fifth NASA Symposium on VLSI Design

    Get PDF
    The fifth annual NASA Symposium on VLSI Design had 13 sessions including Radiation Effects, Architectures, Mixed Signal, Design Techniques, Fault Testing, Synthesis, Signal Processing, and other Featured Presentations. The symposium provides insights into developments in VLSI and digital systems which can be used to increase data systems performance. The presentations share insights into next generation advances that will serve as a basis for future VLSI design

    Belle II Technical Design Report

    Full text link
    The Belle detector at the KEKB electron-positron collider has collected almost 1 billion Y(4S) events in its decade of operation. Super-KEKB, an upgrade of KEKB is under construction, to increase the luminosity by two orders of magnitude during a three-year shutdown, with an ultimate goal of 8E35 /cm^2 /s luminosity. To exploit the increased luminosity, an upgrade of the Belle detector has been proposed. A new international collaboration Belle-II, is being formed. The Technical Design Report presents physics motivation, basic methods of the accelerator upgrade, as well as key improvements of the detector.Comment: Edited by: Z. Dole\v{z}al and S. Un
    corecore