247 research outputs found

    An Application-Specific VLIW Processor with Vector Instruction Set for CNN Acceleration

    Full text link
    In recent years, neural networks have surpassed classical algorithms in areas such as object recognition, e.g. in the well-known ImageNet challenge. As a result, great effort is being put into developing fast and efficient accelerators, especially for Convolutional Neural Networks (CNNs). In this work we present ConvAix, a fully C-programmable processor, which -- contrary to many existing architectures -- does not rely on a hard-wired array of multiply-and-accumulate (MAC) units. Instead it maps computations onto independent vector lanes making use of a carefully designed vector instruction set. The presented processor is targeted towards latency-sensitive applications and is capable of executing up to 192 MAC operations per cycle. ConvAix operates at a target clock frequency of 400 MHz in 28nm CMOS, thereby offering state-of-the-art performance with proper flexibility within its target domain. Simulation results for several 2D convolutional layers from well known CNNs (AlexNet, VGG-16) show an average ALU utilization of 72.5% using vector instructions with 16 bit fixed-point arithmetic. Compared to other well-known designs which are less flexible, ConvAix offers competitive energy efficiency of up to 497 GOP/s/W while even surpassing them in terms of area efficiency and processing speed.Comment: Accepted for publication in the proceedings of the 2019 IEEE International Symposium on Circuits and Systems (ISCAS

    Enabling multi-segment 5G service provisioning and maintenance through network slicing

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in Journal of Network and Systems Management . The final authenticated version is available online at: http://dx.doi.org/10.1007/s10922-019-09509-9The current deployment of 5G networks in a way to support the highly demanding service types defined for 5G, has brought the need for using new techniques to accommodate legacy networks to such requirements. Network Slicing in turn, enables sharing the same underlying physical infrastructure among services with different requirements, thus providing a level of isolation between them to guarantee their proper functionality. In this work, we analyse from an architectural point of view, the required coordination for the provisioning of 5G services over multiple network segments/domains by means of network slicing, considering as well the use of sensors and actuators to maintain slices performance during its lifetime. We set up an experimental multi-segment testbed to demonstrate end-to-end service provisioning and its guarantee in terms of specific QoS parameters, such as latency, throughput and Virtual Network Function (VNF) CPU/RAM consumption. The results provided, demonstrate the workflow between different network components to coordinate the deployment of slices, besides providing a set of examples for slice maintenance through service monitoring and the use of policy-based actuations.Peer ReviewedPostprint (author's final draft

    Χρήση μοντέλου παράλληλου προγραμματισμού για σύνθεση αρχιτεκτονικών

    Get PDF
    The problem of automatically generating hardware modules from high level application representations has been at the forefront of EDA research during the last few years. In this Dissertation we introduce a methodology to automatically synthesize hardware accelerators from OpenCL applications. OpenCL is a recent industry supported standard for writing programs that execute on multicore platforms and accelerators such as GPUs. Our methodology maps OpenCL kernels into hardware accelerators based on architectural templates that explicitly decouple computation from memory communication whenever this is possible. The templates can be tuned to provide a wide repertoire of accelerators that meet user performance requirements and FPGA device characteristics. Furthermore a set of high- and low-level compiler optimizations is applied to generate optimized accelerators. Our experimental evaluation shows that the generated accelerators are tuned efficiently to match the applications memory access pattern and computational complexity and to achieve user performance requirements. An important objective of our tool is to expand the FPGA development user base to software engineers thereby expanding the scope of FPGAs beyond the realm of hardware design.To πρόβλημα της αυτόματης δημιουργίας μονάδων υλικό από παραστάσεις υψηλού επιπέδου εφαρμογής είναι στην πρώτη γραμμή της EDA έρευνας κατά τη διάρκεια των τελευταίων ετών. Σε αυτή την διατριβή παρουσιάζουμε μια μεθοδολογία για τη αυτόματη σύνθεση επιταχυντές υλικού από εφαρμογές OpenCL. OpenCL είναι ένα πρόσφατο πρότυπο για τη σύνταξη των προγραμμάτων που εκτελούνται σε πλατφόρμες πολλαπλών πυρήνων και επιταχυντές όπως GPUs. Η μεθοδολογία μας μετατρέπει προγράμματα OpenCL σε επιταχυντές υλικού με βάση αρχιτεκτονικά πρότυπα που ρητά αποσυνδέει τους υπολογισμούς από την μεταφορά δεδομένων από/προς την μνήμη όποτε αυτό είναι δυνατό. Τα πρότυπα μπορούν να συντονιστούν ώστε να παρέχουν ένα ευρύ ρεπερτόριο από επιταχυντές που πληρούν τις απαιτήσεις απόδοσης των χρηστών και τα χαρακτηριστικά της συσκευής FPGA. Επιπλέον ένα σύνολο υψηλής και χαμηλής στάθμης βελτιστοποιήσεις μεταγλωττιστή εφαρμόζεται για να παράγει βελτιστοποιημένα επιταχυντές. Η πειραματική αξιολόγηση δείχνει ότι οι επιταχυντές που δημιουργούνται αποτελεσματικά συντονισμένοι για να ταιριάζει με το μοτίβο πρόσβασης στην μνήμη κάθε εφαρμογής και την υπολογιστική πολυπλοκότητα και να επιτύχουν τις απαιτήσεις απόδοσης των χρηστών. Ένας σημαντικός στόχος του εργαλείου μας είναι η επέκταση της βάσης χρηστών πλατφόρμες FPGA για μηχανικούς λογισμικού ώστε να γίνει ανάπτυξη FPGA συστήματα από μηχανικούς λογισμικού χωρίς την ανάγκη για εμπειρία σχεδιασμού υλικού

    Optimization and implementation of a Viterbi decoder under flexibility constraints

    Get PDF
    This paper discusses the impact of flexibility when designing a Viterbi decoder for both convolutional and TCM codes. Different trade-offs have to be considered in choosing the right architecture for the processing blocks and the resulting hardware penalty is evaluated. We study the impact of symbol quantization that degrades performance and affects the wordlength of the rate-flexible trellis datapath. A radix-2-based architecture for this datapath relaxes the hardware requirements on the branch metric and survivor path blocks substantially. The cost of flexibility in terms of cell area and power consumption is explored by an investigation of synthesized designs that provide different transmission rates. Two designs are fabricated in a digital 0.13- muhboxmmu{hbox {m}} CMOS process. Based on post-layout simulations, a symbol baud rate of 168 Mbaud/s is achieved in TCM mode, equivalent to a maximum throughput of 840 Mbit/s using a 64-QAM constellation

    A Tutorial on the Implementation of Block Ciphers: Software and Hardware Applications

    Get PDF
    In this article, we discuss basic strategies that can be used to implement block ciphers in both software and hardware environments. As models for discussion, we use substitution-permutation networks which form the basis for many practical block cipher structures. For software implementation, we discuss approaches such as table lookups and bit-slicing, while for hardware implementation, we examine a broad range of architectures from high speed structures like pipelining, to compact structures based on serialization. To illustrate different implementation concepts, we present example data associated with specific methods and discuss sample designs that can be employed to realize different implementation strategies. We expect that the article will be of particular interest to researchers, scientists, and engineers that are new to the field of cryptographic implementation
    corecore