Search CORE

109 research outputs found

Restoring the Broken Covenant Between Compilers and Deep Learning Accelerators

Author: Ahn Byung Hoon
Esmaeilzadeh Hadi
Ghodrati Soroush
Kinzer Sean
Li Xiaolong
Mahapatra Rohan
Mascarenhas Edwin
Matai Janarbek
Zhang Liang
Publication venue
Publication date: 27/10/2023
Field of study

Deep learning accelerators address the computational demands of Deep Neural Networks (DNNs), departing from the traditional Von Neumann execution model. They leverage specialized hardware to align with the application domain's structure. Compilers for these accelerators face distinct challenges compared to those for general-purpose processors. These challenges include exposing and managing more micro-architectural features, handling software-managed scratch pads for on-chip storage, explicitly managing data movement, and matching DNN layers with varying hardware capabilities. These complexities necessitate a new approach to compiler design, as traditional compilers mainly focused on generating fine-grained instruction sequences while abstracting micro-architecture details. This paper introduces the Architecture Covenant Graph (ACG), an abstract representation of an architectural structure's components and their programmable capabilities. By enabling the compiler to work with the ACG, it allows for adaptable compilation workflows when making changes to accelerator design, reducing the need for a complete compiler redevelopment. Codelets, which express DNN operation functionality and evolve into execution mappings on the ACG, are key to this process. The Covenant compiler efficiently targets diverse deep learning accelerators, achieving 93.8% performance compared to state-of-the-art, hand-tuned DNN layer implementations when compiling 14 DNN layers from various models on two different architectures

arXiv.org e-Print Archive

Reconfigurable Systems: A Potential Solution to the von Neumann Bottleneck

Author: Miller Damian L
Publication venue: Scholars Crossing
Publication date: 27/04/2011
Field of study

The difficulty of overcoming the disparity between processor speeds and data access speeds, a condition known as the von Neumann bottleneck, has been a source of consternation for computer hardware developers for many years. Although a number of temporary solutions have been proposed and implemented in modern machines, these solutions have only managed to treat the major symptoms, rather than solve the root problem. As the number of transistors on a chip roughly doubles every two years, the von Neumann bottleneck has continued to tighten in spite of these solutions, prompting some computer hardware professionals to advocate a paradigm shift away from the von Neumann architecture into something entirely new. Many have begun advocating the relatively new technology of reconfigurable systems, popularly known as morphware. The difficulty with adopting a new architectural paradigm, however, is that developers on both sides of the software-hardware spectrum must start from scratch, creating entirely new operating systems, hardware peripherals, application software, and user interfaces, all of which must seem familiar to the end user, yet still take advantage of the improvements morphware has to offer. With this in mind, this thesis builds off of the fundamental theory and current implementations of morphware to describe the processes and products necessary to develop and deliver morphware to the average user as a viable alternative to current technology

Liberty University Digital Commons

Hardware Architectures for Post-Quantum Cryptography

Author: Wang Wen
Publication venue: EliScholar – A Digital Platform for Scholarly Publishing at Yale
Publication date: 01/04/2021
Field of study

The rapid development of quantum computers poses severe threats to many commonly-used cryptographic algorithms that are embedded in different hardware devices to ensure the security and privacy of data and communication. Seeking for new solutions that are potentially resistant against attacks from quantum computers, a new research field called Post-Quantum Cryptography (PQC) has emerged, that is, cryptosystems deployed in classical computers conjectured to be secure against attacks utilizing large-scale quantum computers. In order to secure data during storage or communication, and many other applications in the future, this dissertation focuses on the design, implementation, and evaluation of efficient PQC schemes in hardware. Four PQC algorithms, each from a different family, are studied in this dissertation. The first hardware architecture presented in this dissertation is focused on the code-based scheme Classic McEliece. The research presented in this dissertation is the first that builds the hardware architecture for the Classic McEliece cryptosystem. This research successfully demonstrated that complex code-based PQC algorithm can be run efficiently on hardware. Furthermore, this dissertation shows that implementation of this scheme on hardware can be easily tuned to different configurations by implementing support for flexible choices of security parameters as well as configurable hardware performance parameters. The successful prototype of the Classic McEliece scheme on hardware increased confidence in this scheme, and helped Classic McEliece to get recognized as one of seven finalists in the third round of the NIST PQC standardization process. While Classic McEliece serves as a ready-to-use candidate for many high-end applications, PQC solutions are also needed for low-end embedded devices. Embedded devices play an important role in our daily life. Despite their typically constrained resources, these devices require strong security measures to protect them against cyber attacks. Towards securing this type of devices, the second research presented in this dissertation focuses on the hash-based digital signature scheme XMSS. This research is the first that explores and presents practical hardware based XMSS solution for low-end embedded devices. In the design of XMSS hardware, a heterogenous software-hardware co-design approach was adopted, which combined the flexibility of the soft core with the acceleration from the hard core. The practicability and efficiency of the XMSS software-hardware co-design is further demonstrated by providing a hardware prototype on an open-source RISC-V based System-on-a-Chip (SoC) platform. The third research direction covered in this dissertation focuses on lattice-based cryptography, which represents one of the most promising and popular alternatives to today\u27s widely adopted public key solutions. Prior research has presented hardware designs targeting the computing blocks that are necessary for the implementation of lattice-based systems. However, a recurrent issue in most existing designs is that these hardware designs are not fully scalable or parameterized, hence limited to specific cryptographic primitives and security parameter sets. The research presented in this dissertation is the first that develops hardware accelerators that are designed to be fully parameterized to support different lattice-based schemes and parameters. Further, these accelerators are utilized to realize the first software-harware co-design of provably-secure instances of qTESLA, which is a lattice-based digital signature scheme. This dissertation demonstrates that even demanding, provably-secure schemes can be realized efficiently with proper use of software-hardware co-design. The final research presented in this dissertation is focused on the isogeny-based scheme SIKE, which recently made it to the final round of the PQC standardization process. This research shows that hardware accelerators can be designed to offload compute-intensive elliptic curve and isogeny computations to hardware in a versatile fashion. These hardware accelerators are designed to be fully parameterized to support different security parameter sets of SIKE as well as flexible hardware configurations targeting different user applications. This research is the first that presents versatile hardware accelerators for SIKE that can be mapped efficiently to both FPGA and ASIC platforms. Based on these accelerators, an efficient software-hardwareco-design is constructed for speeding up SIKE. In the end, this dissertation demonstrates that, despite being embedded with expensive arithmetic, the isogeny-based SIKE scheme can be run efficiently by exploiting specialized hardware. These four research directions combined demonstrate the practicability of building efficient hardware architectures for complex PQC algorithms. The exploration of efficient PQC solutions for different hardware platforms will eventually help migrate high-end servers and low-end embedded devices towards the post-quantum era

Yale University

End-to-End Synthesis of Dynamically Controlled Machine Learning Accelerators

Author: Bohm Agostini N.
Brooks D.
Castellana V. G.
Curzel S.
Ferrandi F.
Limaye A.
Manzano J.
Minutoli M.
Tumeo A.
Wei G.
Zhang J. J.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2022
Field of study

Archivio istituzionale della ricerca - Politecnico di Milano

Efficient Design Techniques of Switches for Optical Networks and Data Centers

Author: Kyriakos Angelos
Κυριάκος Άγγελος
Publication venue
Publication date: 01/01/2023
Field of study

Η σύγχρονη σχεδίαση των Κέντρων Δεδομένων εκμεταλλεύεται τις δυνατότητες που προσφέρει η οπτική μεταγωγή με στόχο την διασύνδεση των μεταγωγών ικριώματος μεταξύ τους, οι οποίοι εξυπηρετούν χιλιάδες συσκευές αποθήκευσης και υπολογιστικά συστήματα. Οι καινοτομίες στον τομέα τον οπτικών επικοινωνιών και της οπτικής μεταγωγής συνέβαλλαν σημαντικά στην ανάπτυξη των Κέντρων Δεδομένων με υψηλής διεκπεραιωτικότητας δίκτυα διασύνδεσης. Σημαντική συνεισφορά στα προηγμένα οπτικά Κέντρα Δεδομένων παρουσιάζει η αρχιτεκτονική Nephele, η οποία χρησιμοποιεί οπτικά επίπεδα δεδομένων, οπτικούς μεταγωγούς στα Σημεία Παράδοσης και μεταγωγούς Ικριώματος με δυνατότητα διασύνδεσης της τάξης των 10 Gpbs μεταξύ των Σημείων Παράδοσης και των εξυπηρετητών. Η αρχιτεκτονική Nephele ακολουθεί την Δικτύωση Βασισμένη σε Λογισμικό, χρησιμοποιεί το πρωτόκολλο OpenFlow και στηρίζεται σε έναν Πράκτορα Λογισμικού, ο οποίος υλοποιεί την μεταφορά των εντολών του πρωτοκόλλου στους μεταγωγούς του επιπέδου δεδομένων. ΄Ενας μεταγωγός Ικριώματος καλείται συνήθως να υποστηρίζει την λειτουργία των Εικονικών Ουρών Εξόδου, οι οποίες αποτελούν την επικρατέστερη λύση στο πρόβλημα του αποκλεισμού μετάδοσης πακέτων που προέρχονται από την ίδια είσοδο σε πολλαπλές εξόδους του μεταγωγού. Μία αποτελεσματική αρχιτεκτονική Εικονικών Ουρών Εξόδου βελτιώνει την επίδοση του Κέντρου Δεδομένων μειώνοντας την λανθάνουσα καθυστέρηση της επικοινωνίας πλαισίων δεδομένων και ειναι αποδοτική όσον αφορά το κόστος υλοποίησης. Η συγκεκριμένη διατριβή εισάγει μία αρχιτεκτονική Εικονικών Ουρών Εξόδου για μεταγωγούς Ικριώματος Κέντρων Δεδομένων τα οποία λειτουργούν σύμφωνα με την μέθοδο πολλαπλής πρόσβασης διαίρεσης χρόνου. Η προτεινόμενη αρχιτεκτονική Εικονικών Ουρών Εξόδου περιλαμβάνει έναν περιορισμένο αριθμό ουρών σε κάθε πόρτα εισόδου που υποστηρίζουν τους ενεργούς προορισμούς και αποθηκεύουν προσωρινά τα πακέτα Ethernet σε δυναμική μνήμη τυχαίας προσπέλασης. ΄Ενας αποδοτικός μηχανισμός χαμηλής λανθάνουσας καθυστέρησης αντιστοιχεί κάθε ουρά σε έναν ενεργό προορισμό. Οι Εικονικές Ουρές Εξόδου αποτελούν ένα δομικό στοιχείο του μεταγωγού Ικριώματος, ο οποίος βασίζεται σε ένα εμπορικά διαθέσιμο μεταγωγό Ethernet και σε δύο κάρτες Xilinx FPGA , την Virtex VC707 και την NetFPGA. Η αρχιτεκτονική των Εικονικών Ουρών Εξόδου υλοποιήθηκε και επαληθεύτηκε μέσω δοκιμών στην κάρτα NetFPGA. Επιπλέον, η συγκεκριμένη διατριβή παρουσιάζει ένα εργαλείο διαχείρισης για τον Πράκτορα Λογισμικού του Κέντρου Δεδομένων. Η Γραφική Διεπαφή Χρήστη του εργαλείου διαχείρισης του Πράκτορα Λογισμικού χρησιμοποιείται για την διαμόρφωση του Πράκτορα Λογισμικού, την δημιουργία εντολών, την εκτέλεση λειτουργιών σε βήματα και την παρακολούθηση των αποτελεσμάτων και της κατάστασης των μεταγωγών. Χρησιμοποιούμενο ως εργαλείο δοκιμών και επαλήθευσης, διαδραματίζει ένα σημαντικό ρόλο στην βελτίωση της σχεδίασης του Πράκτορα Λογισμικού καθώς επίσης και στην αναβάθμιση ολόκληρης της οργάνωσης του Κέντρου Δεδομένων και των επιδόσεων του. Επιπρόσθετα, με στόχο την Διασφάλιση της Ποιότητας Υπηρεσιών για τις ποικίλες εφαρμογές των Κέντρων Δεδομένων πρόσφατες έρευνες αξιοποιούν σύγχρονες τεχνικές Βαθιάς Μάθησης. Η πληθώρα από εφαρμογές Μηχανικής και Βαθιάς Μάθησης περιλαμβάνουν πολύπλοκες διεργασίες που επιβάλλουν την ανάγκη των Επιταχυντών Υλικού για την εκτέλεσή τους σε πραγματικό χρόνο. Μεταξύ αυτόν, αξιοσημείωτα είναι τα Συνελικτικά Νευρωνικά Δίκτυα για εφαρμογές κατηγοριοποίησης. Με στόχο την συνεισφορά στον τομέα των Επιταχυντών Υλικού Συνελικτικών Νευρωνικών Δικτύων, η παρούσα διατριβή επικεντρώνεται σε νευρωνικά δίκτυα περιορισμένου αριθμού χαρακτηριστικών για να βελτιώσει τις επιδόσεις, την κατανάλωση ενέργειας και την αξιοποίηση των πόρων, στοιχεία που τελικά θα δώσουν την δυνατότητα για την χρήση τους τοπικά στους μεταγωγούς ενός Κέντρου Δεδομένων. Η προτεινόμενη σχεδιαστική προσέγγιση Συνελικτικών Νευρωνικών Δικτύων στοχεύει στην αξιοποίηση των πόρων λογικής και μνήμης ενός FPGA, και ωφελεί πολυάριθμες εφαρμογές όπως Αποκεντρωμένες και Φορητές εφαρμογές, Κέντρα Δεδομένων και Δορυφορικές εφαρμογές. Η συγκεκριμένη διατριβή εκμεταλλεύεται την προτεινόμενη σχεδιαστική προσέγγιση, ώστε να αναπτύξει ένα Παράδειγμα Επιταχυντή για Αναγνώριση Πλοίων, στην κάρτα Xilinx Virtex 7 XC7VX485T FPGA.Η παραχθείσα αρχιτεκτονική επιτυγχάνει συχνότητα λειτουργίας 270 MHz , καταναλώνοντας 5 watt επαληθεύοντας την σχεδιαστική προσέγγιση.The latest design approach for Data Centers follows the direction of exploiting optical switching to interconnect Top-of-Rack (ToR) switches that serve thousands of data storing and computing devices. Optical switching provided the means for the development of Data Centers with high throughput interconnection networks. A significant contribution to the advanced optical Data Centers designs is the Nephele architecture that employs optical data planes, optical Points of Delivery (PoD) switches and ToR switches equipped with 10 Gbps connections to the PoDs and the servers. Nephele follows the Software Defined Network (SDN) paradigm based on the OpenFlow protocol and it employs an Agent communicating the protocol commands to the data plane. A ToR’s usual function is the Virtual Output Queues (VOQs), which is the prevalent solution for the head-of-line blocking problem of the Data Center switches. An effective VOQs architecture improves the Data Center’s performance by reducing the frames communication latency and it is efficient with respect to the implementation cost. The current thesis introduces a VOQs architecture for the Data Center’s ToR switches that function with Time Division Multiple Access (TDMA). The proposed VOQs architecture contains a bounded number of queues at each input port supporting the active destinations and forwarding the input Ethernet frames to a shared memory buffer. An efficient mechanism of low latency grants each queue to an active destination. The VOQs constitutes a module of a ToR development, which is based on a commercially available Ethernet switch and two FPGA Xilinx boards, the Virtex VC707 and the Xilinx NetFPGA. The VOQs architecture’s implementation and validation took place on the NetFPGA board. Moreover, the current thesis presents a management tool for the control plane’s Agent of the Data Center. The Graphical User Interface (GUI) of the Agent’s management tool is utilized to configure the Agent, create commands, perform step operations and monitor the results and the status. When used as a testing and validation tool, it plays a significant role in the improvement of the Agent’s design as well as in the upgrade of the entire Data Center’s organization and performance. Furthermore, aiming to improve the Quality of Service (QoS) for diverse applications of the Data Center, recent works utilize advanced Deep Learning techniques. The plethora of Machine and Deep Learning applications involve complex processes that impose the need for hardware accelerators to achieve real-time performance. Among these, notable are the Machine Learning (ML) tasks using Convolutional Neural Networks (CNNs) for classification applications.Aiming at contributing to the CNN accelerator solutions, the current thesis focuses on the design of FPGA Accelerators for CNNs of limited feature space to improve performance, power consumption and resource utilization, merits that ultimately enable the use of CNNs locally at the Data Center’s ToR switches. The proposed CNN design approach targets the designs that can utilize the logic and memory resources of a single FPGA device and benefit numerous applications like the Edge, Mobile, Data Center and On-board satellite (OBC) Computing. This work exploits the proposed approach to develop an Example FPGA Accelerator for Vessel Detection, on a Xilinx Virtex 7 XC7VX485T FPGA device. The resulting architecture achieves an operating frequency of 270 MHz, while consuming 5 watts, it validates the approach

Pergamos : Unified Institutional Repository / Digital Library Platform of the National and Kapodistrian University of Athens

Hardware Implementation of Wireless Communications Algorithms: A Practical Approach

Author: Antonio F. Mondragon-Torres
Publication venue: 'IntechOpen'
Publication date: 14/03/2012
Field of study

IntechOpen

Implementing FPGA-optimized Systolic Arrays using 2D Knapsack and Evolutionary Algorithms

Author: Chan Long Chan
Publication venue: 'University of Waterloo'
Publication date: 21/01/2022
Field of study

Underutilization of FPGA resources is a significant challenge in deploying FPGAs as neural network accelerators. We propose an FPGA-optimized systolic array architecture improving the CNN inference throughput by orders of magnitude compared to an un-partitioned systolic array through parallelism-aware partitioning of on-chip resources. We fracture the FPGA into multiple square systolic arrays and formulate the placement of these arrays as a 2D knapsack problem. We simulate the cycle counts needed for each neural network layer given different systolic array sizes using cycle-accurate systolic array simulator - SCALESim. We generate physical implementation and operating frequencies of systolic arrays placed in uniformly staggered locations on Xilinx VU37P and VU9P Ultrascale+ platforms. We use the cycle and frequency information in an optimizer coupling CMA-ES evolutionary algorithm and a simple 2D Knapsack solver to discover packable and routable partitioned designs to maximize throughput. Our experiments' most significant performance improvement comes from the implementation of layers with large kernel sizes. We demonstrate that inference throughput gain of 7-22.7x is possible with a 1.2-7.6x sacrifice of latency. Our optimization tool can achieve up to ~8x higher throughput gain on eight MLPerf benchmark network topologies. Our tool also generates designs across various latency and throughput combinations, providing a wide degree of design selection

University of Waterloo's Institutional Repository