70 research outputs found

    Cross-Layer Rapid Prototyping and Synthesis of Application-Specific and Reconfigurable Many-accelerator Platforms

    Get PDF
    Technological advances of recent years laid the foundation consolidation of informatisationof society, impacting on economic, political, cultural and socialdimensions. At the peak of this realization, today, more and more everydaydevices are connected to the web, giving the term ”Internet of Things”. The futureholds the full connection and interaction of IT and communications systemsto the natural world, delimiting the transition to natural cyber systems and offeringmeta-services in the physical world, such as personalized medical care, autonomoustransportation, smart energy cities etc. . Outlining the necessities of this dynamicallyevolving market, computer engineers are required to implement computingplatforms that incorporate both increased systemic complexity and also cover awide range of meta-characteristics, such as the cost and design time, reliabilityand reuse, which are prescribed by a conflicting set of functional, technical andconstruction constraints. This thesis aims to address these design challenges bydeveloping methodologies and hardware/software co-design tools that enable therapid implementation and efficient synthesis of architectural solutions, which specifyoperating meta-features required by the modern market. Specifically, this thesispresents a) methodologies to accelerate the design flow for both reconfigurableand application-specific architectures, b) coarse-grain heterogeneous architecturaltemplates for processing and communication acceleration and c) efficient multiobjectivesynthesis techniques both at high abstraction level of programming andphysical silicon level.Regarding to the acceleration of the design flow, the proposed methodologyemploys virtual platforms in order to hide architectural details and drastically reducesimulation time. An extension of this framework introduces the systemicco-simulation using reconfigurable acceleration platforms as co-emulation intermediateplatforms. Thus, the development cycle of a hardware/software productis accelerated by moving from a vertical serial flow to a circular interactive loop.Moreover the simulation capabilities are enriched with efficient detection and correctiontechniques of design errors, as well as control methods of performancemetrics of the system according to the desired specifications, during all phasesof the system development. In orthogonal correlation with the aforementionedmethodological framework, a new architectural template is proposed, aiming atbridging the gap between design complexity and technological productivity usingspecialized hardware accelerators in heterogeneous systems-on-chip and networkon-chip platforms. It is presented a novel co-design methodology for the hardwareaccelerators and their respective programming software, including the tasks allocationto the available resources of the system/network. The introduced frameworkprovides implementation techniques for the accelerators, using either conventionalprogramming flows with hardware description language or abstract programmingmodel flows, using techniques from high-level synthesis. In any case, it is providedthe option of systemic measures optimization, such as the processing speed,the throughput, the reliability, the power consumption and the design silicon area.Finally, on addressing the increased complexity in design tools of reconfigurablesystems, there are proposed novel multi-objective optimization evolutionary algo-rithms which exploit the modern multicore processors and the coarse-grain natureof multithreaded programming environments (e.g. OpenMP) in order to reduce theplacement time, while by simultaneously grouping the applications based on theirintrinsic characteristics, the effectively explore the design space effectively.The efficiency of the proposed architectural templates, design tools and methodologyflows is evaluated in relation to the existing edge solutions with applicationsfrom typical computing domains, such as digital signal processing, multimedia andarithmetic complexity, as well as from systemic heterogeneous environments, suchas a computer vision system for autonomous robotic space navigation and manyacceleratorsystems for HPC and workstations/datacenters. The results strengthenthe belief of the author, that this thesis provides competitive expertise to addresscomplex modern - and projected future - design challenges.Οι τεχνολογικές εξελίξεις των τελευταίων ετών έθεσαν τα θεμέλια εδραίωσης της πληροφοριοποίησης της κοινωνίας, επιδρώντας σε οικονομικές,πολιτικές, πολιτιστικές και κοινωνικές διαστάσεις. Στο απόγειο αυτής τη ςπραγμάτωσης, σήμερα, ολοένα και περισσότερες καθημερινές συσκευές συνδέονται στο παγκόσμιο ιστό, αποδίδοντας τον όρο «Ίντερνετ των πραγμάτων».Το μέλλον επιφυλάσσει την πλήρη σύνδεση και αλληλεπίδραση των συστημάτων πληροφορικής και επικοινωνιών με τον φυσικό κόσμο, οριοθετώντας τη μετάβαση στα συστήματα φυσικού κυβερνοχώρου και προσφέροντας μεταυπηρεσίες στον φυσικό κόσμο όπως προσωποποιημένη ιατρική περίθαλψη, αυτόνομες μετακινήσεις, έξυπνες ενεργειακά πόλεις κ.α. . Σκιαγραφώντας τις ανάγκες αυτής της δυναμικά εξελισσόμενης αγοράς, οι μηχανικοί υπολογιστών καλούνται να υλοποιήσουν υπολογιστικές πλατφόρμες που αφενός ενσωματώνουν αυξημένη συστημική πολυπλοκότητα και αφετέρου καλύπτουν ένα ευρύ φάσμα μεταχαρακτηριστικών, όπως λ.χ. το κόστος σχεδιασμού, ο χρόνος σχεδιασμού, η αξιοπιστία και η επαναχρησιμοποίηση, τα οποία προδιαγράφονται από ένα αντικρουόμενο σύνολο λειτουργικών, τεχνολογικών και κατασκευαστικών περιορισμών. Η παρούσα διατριβή στοχεύει στην αντιμετώπιση των παραπάνω σχεδιαστικών προκλήσεων, μέσω της ανάπτυξης μεθοδολογιών και εργαλείων συνσχεδίασης υλικού/λογισμικού που επιτρέπουν την ταχεία υλοποίηση καθώς και την αποδοτική σύνθεση αρχιτεκτονικών λύσεων, οι οποίες προδιαγράφουν τα μετα-χαρακτηριστικά λειτουργίας που απαιτεί η σύγχρονη αγορά. Συγκεκριμένα, στα πλαίσια αυτής της διατριβής, παρουσιάζονται α) μεθοδολογίες επιτάχυνσης της ροής σχεδιασμού τόσο για επαναδιαμορφούμενες όσο και για εξειδικευμένες αρχιτεκτονικές, β) ετερογενή αδρομερή αρχιτεκτονικά πρότυπα επιτάχυνσης επεξεργασίας και επικοινωνίας και γ) αποδοτικές τεχνικές πολυκριτηριακής σύνθεσης τόσο σε υψηλό αφαιρετικό επίπεδο προγραμματισμού,όσο και σε φυσικό επίπεδο πυριτίου.Αναφορικά προς την επιτάχυνση της ροής σχεδιασμού, προτείνεται μια μεθοδολογία που χρησιμοποιεί εικονικές πλατφόρμες, οι οποίες αφαιρώντας τις αρχιτεκτονικές λεπτομέρειες καταφέρνουν να μειώσουν σημαντικά το χρόνο εξομοίωσης. Παράλληλα, εισηγείται η συστημική συν-εξομοίωση με τη χρήση επαναδιαμορφούμενων πλατφορμών, ως μέσων επιτάχυνσης. Με αυτόν τον τρόπο, ο κύκλος ανάπτυξης ενός προϊόντος υλικού, μετατεθειμένος από την κάθετη σειριακή ροή σε έναν κυκλικό αλληλεπιδραστικό βρόγχο, καθίσταται ταχύτερος, ενώ οι δυνατότητες προσομοίωσης εμπλουτίζονται με αποδοτικότερες μεθόδους εντοπισμού και διόρθωσης σχεδιαστικών σφαλμάτων, καθώς και μεθόδους ελέγχου των μετρικών απόδοσης του συστήματος σε σχέση με τις επιθυμητές προδιαγραφές, σε όλες τις φάσεις ανάπτυξης του συστήματος. Σε ορθογώνια συνάφεια με το προαναφερθέν μεθοδολογικό πλαίσιο, προτείνονται νέα αρχιτεκτονικά πρότυπα που στοχεύουν στη γεφύρωση του χάσματος μεταξύ της σχεδιαστικής πολυπλοκότητας και της τεχνολογικής παραγωγικότητας, με τη χρήση συστημάτων εξειδικευμένων επιταχυντών υλικού σε ετερογενή συστήματα-σε-ψηφίδα καθώς και δίκτυα-σε-ψηφίδα. Παρουσιάζεται κατάλληλη μεθοδολογία συν-σχεδίασης των επιταχυντών υλικού και του λογισμικού προκειμένου να αποφασισθεί η κατανομή των εργασιών στους διαθέσιμους πόρους του συστήματος/δικτύου. Το μεθοδολογικό πλαίσιο προβλέπει την υλοποίηση των επιταχυντών είτε με συμβατικές μεθόδους προγραμματισμού σε γλώσσα περιγραφής υλικού είτε με αφαιρετικό προγραμματιστικό μοντέλο με τη χρήση τεχνικών υψηλού επιπέδου σύνθεσης. Σε κάθε περίπτωση, δίδεται η δυνατότητα στο σχεδιαστή για βελτιστοποίηση συστημικών μετρικών, όπως η ταχύτητα επεξεργασίας, η ρυθμαπόδοση, η αξιοπιστία, η κατανάλωση ενέργειας και η επιφάνεια πυριτίου του σχεδιασμού. Τέλος, προκειμένου να αντιμετωπισθεί η αυξημένη πολυπλοκότητα στα σχεδιαστικά εργαλεία επαναδιαμορφούμενων συστημάτων, προτείνονται νέοι εξελικτικοί αλγόριθμοι πολυκριτηριακής βελτιστοποίησης, οι οποίοι εκμεταλλευόμενοι τους σύγχρονους πολυπύρηνους επεξεργαστές και την αδρομερή φύση των πολυνηματικών περιβαλλόντων προγραμματισμού (π.χ. OpenMP), μειώνουν το χρόνο επίλυσης του προβλήματος της τοποθέτησης των λογικών πόρων σε φυσικούς,ενώ ταυτόχρονα, ομαδοποιώντας τις εφαρμογές βάση των εγγενών χαρακτηριστικών τους, διερευνούν αποτελεσματικότερα το χώρο σχεδίασης.Η αποδοτικότητά των προτεινόμενων αρχιτεκτονικών προτύπων και μεθοδολογιών επαληθεύτηκε σε σχέση με τις υφιστάμενες λύσεις αιχμής τόσο σε αυτοτελής εφαρμογές, όπως η ψηφιακή επεξεργασία σήματος, τα πολυμέσα και τα προβλήματα αριθμητικής πολυπλοκότητας, καθώς και σε συστημικά ετερογενή περιβάλλοντα, όπως ένα σύστημα όρασης υπολογιστών για αυτόνομα διαστημικά ρομποτικά οχήματα και ένα σύστημα πολλαπλών επιταχυντών υλικού για σταθμούς εργασίας και κέντρα δεδομένων, στοχεύοντας εφαρμογές υψηλής υπολογιστικής απόδοσης (HPC). Τα αποτελέσματα ενισχύουν την πεποίθηση του γράφοντα, ότι η παρούσα διατριβή παρέχει ανταγωνιστική τεχνογνωσία για την αντιμετώπιση των πολύπλοκων σύγχρονων και προβλεπόμενα μελλοντικών σχεδιαστικών προκλήσεων

    On the Exploration of FPGAs and High-Level Synthesis Capabilities on Multi-Gigabit-per-Second Networks

    Full text link
    Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Tecnología Electrónica y de las Comunicaciones. Fecha de lectura: 24-01-2020Traffic on computer networks has faced an exponential grown in recent years. Both links and communication equipment had to adapt in order to provide a minimum quality of service required for current needs. However, in recent years, a few factors have prevented commercial off-the-shelf hardware from being able to keep pace with this growth rate, consequently, some software tools are struggling to fulfill their tasks, especially at speeds higher than 10 Gbit/s. For this reason, Field Programmable Gate Arrays (FPGAs) have arisen as an alternative to address the most demanding tasks without the need to design an application specific integrated circuit, this is in part to their flexibility and programmability in the field. Needless to say, developing for FPGAs is well-known to be complex. Therefore, in this thesis we tackle the use of FPGAs and High-Level Synthesis (HLS) languages in the context of computer networks. We focus on the use of FPGA both in computer network monitoring application and reliable data transmission at very high-speed. On the other hand, we intend to shed light on the use of high level synthesis languages and boost FPGA applicability in the context of computer networks so as to reduce development time and design complexity. In the first part of the thesis, devoted to computer network monitoring. We take advantage of the FPGA determinism in order to implement active monitoring probes, which consist on sending a train of packets which is later used to obtain network parameters. In this case, the determinism is key to reduce the uncertainty of the measurements. The results of our experiments show that the FPGA implementations are much more accurate and more precise than the software counterpart. At the same time, the FPGA implementation is scalable in terms of network speed — 1, 10 and 100 Gbit/s. In the context of passive monitoring, we leverage the FPGA architecture to implement algorithms able to thin cyphered traffic as well as removing duplicate packets. These two algorithms straightforward in principle, but very useful to help traditional network analysis tools to cope with their task at higher network speeds. On one hand, processing cyphered traffic bring little benefits, on the other hand, processing duplicate traffic impacts negatively in the performance of the software tools. In the second part of the thesis, devoted to the TCP/IP stack. We explore the current limitations of reliable data transmission using standard software at very high-speed. Nowadays, the network is becoming an important bottleneck to fulfill current needs, in particular in data centers. What is more, in recent years the deployment of 100 Gbit/s network links has started. Consequently, there has been an increase scrutiny of how networking functionality is deployed, furthermore, a wide range of approaches are currently being explored to increase the efficiency of networks and tailor its functionality to the actual needs of the application at hand. FPGAs arise as the perfect alternative to deal with this problem. For this reason, in this thesis we develop Limago an FPGA-based open-source implementation of a TCP/IP stack operating at 100 Gbit/s for Xilinx’s FPGAs. Limago not only provides an unprecedented throughput, but also, provides a tiny latency when compared to the software implementations, at least fifteen times. Limago is a key contribution in some of the hottest topic at the moment, for instance, network-attached FPGA and in-network data processing

    Robust and reliable hardware accelerator design through high-level synthesis

    Get PDF
    System-on-chip design is becoming increasingly complex as technology scaling enables more and more functionality on a chip. This scaling-driven complexity has resulted in a variety of reliability and validation challenges including logic bugs, hot spots, wear-out, and soft errors. To make matters worse, as we reach the limits of Dennard scaling, efforts to improve system performance and energy efficiency have resulted in the integration of a wide variety of complex hardware accelerators in SoCs. Thus the challenge is to design complex, custom hardware that is efficient, but also correct and reliable. High-level synthesis shows promise to address the problem of complex hardware design by providing a bridge from the high-productivity software domain to the hardware design process. Much research has been done on high-level synthesis efficiency optimizations. This dissertation shows that high-level synthesis also has the power to address validation and reliability challenges through three automated solutions targeting three key stages in the hardware design and use cycle: pre-silicon debugging, post-silicon validation, and post-deployment error detection. Our solution for rapid pre-silicon debugging of accelerator designs is hybrid tracing: comparing a datapath-level trace of hardware execution with a reference software implementation at a fine temporal and spatial granularity to detect logic bugs. An integrated backtrace process delivers source-code meaning to the hardware designer, pinpointing the location of bug activation and providing a strong hint for potential bug fixes. Experimental results show that we are able to detect and aid in localization of logic bugs from both C/C++ specifications as well as the high-level synthesis engine itself. A variation of this solution tailored for rapid post-silicon validation of accelerator designs is hybrid hashing: inserting signature generation logic in a hardware design to create a heavily compressed signature stream that captures the internal behavior of the design at a fine temporal and spatial granularity for comparison with a reference set of signatures generated by high-level simulation to detect bugs. Using hybrid hashing, we demonstrate an improvement in error detection latency (time elapsed from when a bug is activated to when it manifests as an observable failure) of two orders of magnitude and a threefold improvement in bug coverage compared to traditional post-silicon validation techniques. Hybrid hashing also uncovered previously unknown bugs in the CHStone benchmark suite, which is widely used by the HLS community. Hybrid hashing incurs less than 10% area overhead for the accelerator it validates with negligible performance impact, and we also introduce techniques to minimize any possible intrusiveness introduced by hybrid hashing. Finally, our solution for post-deployment error detection is modulo-3 shadow datapaths: performing lightweight shadow computations in modulo-3 space for each main computation. We leverage the binding and scheduling flexibility of high-level synthesis to detect control errors through diverse binding and minimize area cost through intelligent checkpoint scheduling and modulo-3 reducer sharing. We introduce logic and dataflow optimizations to further reduce cost. We evaluated our technique with 12 high-level synthesis benchmarks from the arithmetic-oriented PolyBench benchmark suite using FPGA emulated netlist-level error injection. We observe coverages of 99.1% for stuck-at faults, 99.5% for soft errors, and 99.6% for timing errors with a 25.7% area cost and negligible performance impact. Leveraging a mean error detection latency of 12.75 cycles (4150× faster than end result check) for soft errors, we also explore a rollback recovery method with an additional area cost of 28.0%, observing a 175× increase in reliability against soft errors. While the area cost of our modulo shadow datapaths is much better than traditional modular redundancy approaches, we want to maximize the applicability of our approach. To this end, we take a dive into gate-level architectural design for modulo arithmetic functional units. We introduce new low-cost gate-level architectures for all four key functional units in a shadow datapath: (1) a modulo reduction algorithm that generates architectures consisting entirely of full-adder standard cells; (2) minimum-area modulo adder and subtractor architectures; (3) an array-based modulo multiplier design; and (4) a modulo equality comparator that handles the residue encoding produced by the above. We compare our new functional units to the previous state-of-the-art approach, observing a 12.5% reduction in area and a 47.1% reduction in delay for a 32-bit mod-3 reducer; that our reducer costs, which tend to dominate shadow datapath costs, do not increase with larger modulo bases; and that for modulo-15 and above, all of our modulo functional units have better area and delay then their previous counterparts. We also demonstrate the practicality of our approach by designing a custom shadow datapath for error detection of a multiply accumulate functional unit, which has an area overhead of only 12% for a 32-bit main datapath and 2-bit modulo-3 shadow datapath. Taking our reliability solution further, we look at the bigger picture of modulo shadow datapaths combined with other solutions at different abstraction layers, looking to answer the following question: Given all of the existing reliability improvement techniques for application-specific hardware accelerators, what techniques or combinations of techniques are the most cost-effective? To answer this question, we consider a soft error fault model and empirically evaluate cross-layer combinations of ABFT, EDDI, and modulo shadow datapaths in the context of high-level synthesis; parity in logic synthesis; and flip-flop hardening techniques at the physical design level. We measure the reliability benefit and area, energy, and performance cost of each technique individually and for interesting technique combinations through FPGA emulated fault-injection and physical place-and-route. Our results show that a combination of parity and flip-flop hardening is the most cost-effective in general with an average 1.3% area cost and 5.7% energy cost for a 50× improvement in reliability. The addition of modulo-3 shadow datapaths to this combination provides some additional benefit for some applications, even without considering its combinational logic, stuck-at fault, and timing error protection benefits. We also observe new efficiency challenges for ABFT and EDDI when used for hardware accelerators

    Automated Debugging Methodology for FPGA-based Systems

    Get PDF
    Electronic devices make up a vital part of our lives. These are seen from mobiles, laptops, computers, home automation, etc. to name a few. The modern designs constitute billions of transistors. However, with this evolution, ensuring that the devices fulfill the designer’s expectation under variable conditions has also become a great challenge. This requires a lot of design time and effort. Whenever an error is encountered, the process is re-started. Hence, it is desired to minimize the number of spins required to achieve an error-free product, as each spin results in loss of time and effort. Software-based simulation systems present the main technique to ensure the verification of the design before fabrication. However, few design errors (bugs) are likely to escape the simulation process. Such bugs subsequently appear during the post-silicon phase. Finding such bugs is time-consuming due to inherent invisibility of the hardware. Instead of software simulation of the design in the pre-silicon phase, post-silicon techniques permit the designers to verify the functionality through the physical implementations of the design. The main benefit of the methodology is that the implemented design in the post-silicon phase runs many order-of-magnitude faster than its counterpart in pre-silicon. This allows the designers to validate their design more exhaustively. This thesis presents five main contributions to enable a fast and automated debugging solution for reconfigurable hardware. During the research work, we used an obstacle avoidance system for robotic vehicles as a use case to illustrate how to apply the proposed debugging solution in practical environments. The first contribution presents a debugging system capable of providing a lossless trace of debugging data which permits a cycle-accurate replay. This methodology ensures capturing permanent as well as intermittent errors in the implemented design. The contribution also describes a solution to enhance hardware observability. It is proposed to utilize processor-configurable concentration networks, employ debug data compression to transmit the data more efficiently, and partially reconfiguring the debugging system at run-time to save the time required for design re-compilation as well as preserve the timing closure. The second contribution presents a solution for communication-centric designs. Furthermore, solutions for designs with multi-clock domains are also discussed. The third contribution presents a priority-based signal selection methodology to identify the signals which can be more helpful during the debugging process. A connectivity generation tool is also presented which can map the identified signals to the debugging system. The fourth contribution presents an automated error detection solution which can help in capturing the permanent as well as intermittent errors without continuous monitoring of debugging data. The proposed solution works for designs even in the absence of golden reference. The fifth contribution proposes to use artificial intelligence for post-silicon debugging. We presented a novel idea of using a recurrent neural network for debugging when a golden reference is present for training the network. Furthermore, the idea was also extended to designs where golden reference is not present

    High-Level Synthesis Based VLSI Architectures for Video Coding

    Get PDF
    High Efficiency Video Coding (HEVC) is state-of-the-art video coding standard. Emerging applications like free-viewpoint video, 360degree video, augmented reality, 3D movies etc. require standardized extensions of HEVC. The standardized extensions of HEVC include HEVC Scalable Video Coding (SHVC), HEVC Multiview Video Coding (MV-HEVC), MV-HEVC+ Depth (3D-HEVC) and HEVC Screen Content Coding. 3D-HEVC is used for applications like view synthesis generation, free-viewpoint video. Coding and transmission of depth maps in 3D-HEVC is used for the virtual view synthesis by the algorithms like Depth Image Based Rendering (DIBR). As first step, we performed the profiling of the 3D-HEVC standard. Computational intensive parts of the standard are identified for the efficient hardware implementation. One of the computational intensive part of the 3D-HEVC, HEVC and H.264/AVC is the Interpolation Filtering used for Fractional Motion Estimation (FME). The hardware implementation of the interpolation filtering is carried out using High-Level Synthesis (HLS) tools. Xilinx Vivado Design Suite is used for the HLS implementation of the interpolation filters of HEVC and H.264/AVC. The complexity of the digital systems is greatly increased. High-Level Synthesis is the methodology which offers great benefits such as late architectural or functional changes without time consuming in rewriting of RTL-code, algorithms can be tested and evaluated early in the design cycle and development of accurate models against which the final hardware can be verified

    Circuit-Variant Moving Target Defense for Side-Channel Attacks on Reconfigurable Hardware

    Get PDF
    With the emergence of side-channel analysis (SCA) attacks, bits of a secret key may be derived by correlating key values with physical properties of cryptographic process execution. Power and Electromagnetic (EM) analysis attacks are based on the principle that current flow within a cryptographic device is key-dependent and therefore, the resulting power consumption and EM emanations during encryption and/or decryption can be correlated to secret key values. These side-channel attacks require several measurements of the target process in order to amplify the signal of interest, filter out noise, and derive the secret key through statistical analysis methods. Differential power and EM analysis attacks rely on correlating actual side-channel measurements to hypothetical models. This research proposes increasing resistance to differential power and EM analysis attacks through structural and spatial randomization of an implementation. By introducing randomly located circuit variants of encryption components, the proposed moving target defense aims to disrupt side-channel collection and correlation needed to successfully implement an attac

    Performance Optimization of Memory Intensive Applications on FPGA Accelerator

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen
    corecore