47 research outputs found

    Proceedings of the 5th International Workshop on Reconfigurable Communication-centric Systems on Chip 2010 - ReCoSoC\u2710 - May 17-19, 2010 Karlsruhe, Germany. (KIT Scientific Reports ; 7551)

    Get PDF
    ReCoSoC is intended to be a periodic annual meeting to expose and discuss gathered expertise as well as state of the art research around SoC related topics through plenary invited papers and posters. The workshop aims to provide a prospective view of tomorrow\u27s challenges in the multibillion transistor era, taking into account the emerging techniques and architectures exploring the synergy between flexible on-chip communication and system reconfigurability

    GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryption

    Full text link
    Fully Homomorphic Encryption (FHE) enables the processing of encrypted data without decrypting it. FHE has garnered significant attention over the past decade as it supports secure outsourcing of data processing to remote cloud services. Despite its promise of strong data privacy and security guarantees, FHE introduces a slowdown of up to five orders of magnitude as compared to the same computation using plaintext data. This overhead is presently a major barrier to the commercial adoption of FHE. In this work, we leverage GPUs to accelerate FHE, capitalizing on a well-established GPU ecosystem available in the cloud. We propose GME, which combines three key microarchitectural extensions along with a compile-time optimization to the current AMD CDNA GPU architecture. First, GME integrates a lightweight on-chip compute unit (CU)-side hierarchical interconnect to retain ciphertext in cache across FHE kernels, thus eliminating redundant memory transactions. Second, to tackle compute bottlenecks, GME introduces special MOD-units that provide native custom hardware support for modular reduction operations, one of the most commonly executed sets of operations in FHE. Third, by integrating the MOD-unit with our novel pipelined 6464-bit integer arithmetic cores (WMAC-units), GME further accelerates FHE workloads by 19%19\%. Finally, we propose a Locality-Aware Block Scheduler (LABS) that exploits the temporal locality available in FHE primitive blocks. Incorporating these microarchitectural features and compiler optimizations, we create a synergistic approach achieving average speedups of 796×796\times, 14.2×14.2\times, and 2.3×2.3\times over Intel Xeon CPU, NVIDIA V100 GPU, and Xilinx FPGA implementations, respectively

    Network-on-Chip

    Get PDF
    Addresses the Challenges Associated with System-on-Chip Integration Network-on-Chip: The Next Generation of System-on-Chip Integration examines the current issues restricting chip-on-chip communication efficiency, and explores Network-on-chip (NoC), a promising alternative that equips designers with the capability to produce a scalable, reusable, and high-performance communication backbone by allowing for the integration of a large number of cores on a single system-on-chip (SoC). This book provides a basic overview of topics associated with NoC-based design: communication infrastructure design, communication methodology, evaluation framework, and mapping of applications onto NoC. It details the design and evaluation of different proposed NoC structures, low-power techniques, signal integrity and reliability issues, application mapping, testing, and future trends. Utilizing examples of chips that have been implemented in industry and academia, this text presents the full architectural design of components verified through implementation in industrial CAD tools. It describes NoC research and developments, incorporates theoretical proofs strengthening the analysis procedures, and includes algorithms used in NoC design and synthesis. In addition, it considers other upcoming NoC issues, such as low-power NoC design, signal integrity issues, NoC testing, reconfiguration, synthesis, and 3-D NoC design. This text comprises 12 chapters and covers: The evolution of NoC from SoC—its research and developmental challenges NoC protocols, elaborating flow control, available network topologies, routing mechanisms, fault tolerance, quality-of-service support, and the design of network interfaces The router design strategies followed in NoCs The evaluation mechanism of NoC architectures The application mapping strategies followed in NoCs Low-power design techniques specifically followed in NoCs The signal integrity and reliability issues of NoC The details of NoC testing strategies reported so far The problem of synthesizing application-specific NoCs Reconfigurable NoC design issues Direction of future research and development in the field of NoC Network-on-Chip: The Next Generation of System-on-Chip Integration covers the basic topics, technology, and future trends relevant to NoC-based design, and can be used by engineers, students, and researchers and other industry professionals interested in computer architecture, embedded systems, and parallel/distributed systems

    Experimental Evaluation and Comparison of Time-Multiplexed Multi-FPGA Routing Architectures

    Get PDF
    Emulating large complex designs require multi-FPGA systems (MFS). However, inter-FPGA communication is confronted by the challenge of lack of interconnect capacity due to limited number of FPGA input/output (I/O) pins. Serializing parallel signals onto a single trace effectively addresses the limited I/O pin obstacle. Besides the multiplexing scheme and multiplexing ratio (number of inter-FPGA signals per trace), the choice of the MFS routing architecture also affect the critical path latency. The routing architecture of an MFS is the interconnection pattern of FPGAs, fixed wires and/or programmable interconnect chips. Performance of existing MFS routing architectures is also limited by off-chip interface selection. In this dissertation we proposed novel 2D and 3D latency-optimized time-multiplexed MFS routing architectures. We used rigorous experimental approach and real sequential benchmark circuits to evaluate and compare the proposed and existing MFS routing architectures. This research provides a new insight into the encouraging effects of using off-chip optical interface and three dimensional MFS routing architectures. The vertical stacking results in shorter off-chip links improving the overall system frequency with the additional advantage of smaller footprint area. The proposed 3D architectures employed serialized interconnect between intra-plane and inter-plane FPGAs to address the pin limitation problem. Additionally, all off-chip links are replaced by optical fibers that exhibited latency improvement and resulted in faster MFS. Results indicated that exploiting third dimension provided latency and area improvements as compared to 2D MFS. We also proposed latency-optimized planar 2D MFS architectures in which electrical interconnections are replaced by optical interface in same spatial distribution. Performance evaluation and comparison showed that the proposed architectures have reduced critical path delay and system frequency improvement as compared to conventional MFS. We also experimentally evaluated and compared the system performance of three inter-FPGA communication schemes i.e. Logic Multiplexing, SERDES and MGT in conjunction with two routing architectures i.e. Completely Connected Graph (CCG) and TORUS. Experimental results showed that SERDES attained maximum frequency than the other two schemes. However, for very high multiplexing ratios, the performance of SERDES & MGT became comparable

    Architectures multi-Asip pour turbo récepteur flexible

    Get PDF
    Rapidly evolving wireless standards use modern techniques such as turbo codes, Bit Interleaved coded Modulation (BICM), high order QAM constellation, Signal Space Diversity (SSD), Multi-Input Multi-Output (MIMO) Spatial Multiplexing (SM) and Space Time Codes (STC) with different parameters for reliable high rate data transmissions. Adoption of such techniques in the transmitter can impact the receiver architecture in three ways: (1) the complex processing related to advanced techniques such as turbo codes, encourage to perform iterative processing in the receiver to improve error rate performance (2) to satisfy high throughput requirement for an iterative receiver, parallel processing is mandatory and finally (3) to allow the support of different techniques and parameters imposed, programmable yet high throughput hardware processing elements are required. In this thesis, to address the high throughput requirement with turbo processing, first of all a study of parallelism on turbo decoding is extended for turbo demodulation and turbo equalization. Based on the results acquired from the parallelism study a flexible high throughput heterogeneous multi-ASIP NoC based unified turbo receiver is proposed. The proposed architecture fulfils the target requirements in a way that: (a) Application Specific Instruction-set Processor (ASIP) exploits metric generation level parallelism and implements the required flexibility, (b) throughputs beyond the capacity of single ASIP in a turbo process are achieved through multiple ASIP elements implementing sub-block parallelism and shuffled processing and finally (c) Network on Chip is used to handle communication conflicts during parallel processing of multiple ASIPs. In pursuit to achieve a hardware model of the proposed architecture two ASIPs are conceived where the first one, namely EquASIP, is dedicated for MMSE-IC equalization and provides a flexible solution for multiple MIMO techniques adopted in multiple wireless standards with a capability to work in turbo equalization context. The second ASIP, named as DemASIP, is a flexible demapper which can be used in MIMO or single antenna environment for any modulation till 256-QAM with or without iterative demodulation. Using available TurbASIP and NoC components, the thesis concludes on an FPGA prototype of heterogeneous multi-ASIP NoC based unified turbo receiver which integrates 9 instances of 3 different ASIPs with 2 NoCs.Les normes de communication sans fil, sans cesse en évolution, imposent l'utilisation de techniques modernes telles que les turbocodes, modulation codée à entrelacement bit (BICM), constellation MAQ d'ordre élevé, diversité de constellation (SSD), multiplexage spatial et codage espace-temps multi-antennes (MIMO) avec des paramètres différents pour des transmissions fiables et de haut débit. L'adoption de ces techniques dans l'émetteur peut influencer l'architecture du récepteur de trois façons: (1) les traitement complexes relatifs aux techniques avancées comme les turbocodes, encourage à effectuer un traitement itératif dans le récepteur pour améliorer la performance en termes de taux d'erreur (2) pour satisfaire l'exigence de haut débit avec un récepteur itératif, le recours au parallélisme est obligatoire et enfin (3) pour assurer le support des différentes techniques et paramètres imposées, des processeurs de traitement matériel flexibles, mais aussi de haute performance, sont nécessaires. Dans cette thèse, pour répondre aux besoins de haut débit dans un contexte de traitement itératif, tout d'abord une étude de parallélisme sur le turbo décodage a été étendue aux applications de turbo démodulation et turbo égalisation. Partant des résultats obtenus à partir de l'étude du parallélisme, un récepteur itératif unifié basé sur un modèle d'architecture multi-ASIP hétérogène intégrant un réseau sur puce (NoC) a été proposé. L'architecture proposée répond aux exigences visées d'une manière où: (a) le concept de processeur à jeu d'instruction dédié à l'application (ASIP) exploite le parallélisme du niveau de génération de métriques et met en oeuvre la flexibilité nécessaire, (b) les débits au-delà de la capacité d'un seul ASIP dans un processus itératif sont obtenus au moyen de multiples ASIP implémentant le parallélisme de sous-blocs et le traitement combiné et enfin (c) le concept de réseau sur puce (NoC) est utilisé pour gérer les conflits de communication au cours du traitement parallèle itératif multi-ASIP. Dans le but de parvenir à un modèle matériel de l'architecture proposée, deux ASIP ont été conçus où le premier, nommé EquASIP, est dédié à l'égalisation MMSE-IC et fournit une solution flexible pour de multiples techniques multi-antennes adoptés dans plusieurs normes sans fil avec la capacité de travailler dans un contexte de turbo égalisation. Le deuxième ASIP, nommé DemASIP, est un démappeur flexible qui peut être utilisé dans un environnement multi-antennes et pour tout type de modulation jusqu'à MAQ-256 avec ou sans démodulation itérative. En intégrant ces ASIP, en plus des NoC et TurbASIP disponibles à Télécom Bretagne, la thèse conclut sur un prototype FPGA d'un récepteur itératif unifié multi-ASIP qui intègre 9 coeurs de 3 différents types d'ASIP avec 2 NoC

    Approximation Opportunities in Edge Computing Hardware : A Systematic Literature Review

    Get PDF
    With the increasing popularity of the Internet of Things and massive Machine Type Communication technologies, the number of connected devices is rising. However, while enabling valuable effects to our lives, bandwidth and latency constraints challenge Cloud processing of their associated data amounts. A promising solution to these challenges is the combination of Edge and approximate computing techniques that allows for data processing nearer to the user. This paper aims to survey the potential benefits of these paradigms’ intersection. We provide a state-of-the-art review of circuit-level and architecture-level hardware techniques and popular applications. We also outline essential future research directions.publishedVersionPeer reviewe

    High performance communication on reconfigurable clusters

    Get PDF
    High Performance Computing (HPC) has matured to where it is an essential third pillar, along with theory and experiment, in most domains of science and engineering. Communication latency is a key factor that is limiting the performance of HPC, but can be addressed by integrating communication into accelerators. This integration allows accelerators to communicate with each other without CPU interactions, and even bypassing the network stack. Field Programmable Gate Arrays (FPGAs) are the accelerators that currently best integrate communication with computation. The large number of Multi-gigabit Transceivers (MGTs) on most high-end FPGAs can provide high-bandwidth and low-latency inter-FPGA connections. Additionally, the reconfigurable FPGA fabric enables tight coupling between computation kernel and network interface. Our thesis is that an application-aware communication infrastructure for a multi-FPGA system makes substantial progress in solving the HPC communication bottleneck. This dissertation aims to provide an application-aware solution for communication infrastructure for FPGA-centric clusters. Specifically, our solution demonstrates application-awareness across multiple levels in the network stack, including low-level link protocols, router microarchitectures, routing algorithms, and applications. We start by investigating the low-level link protocol and the impact of its latency variance on performance. Our results demonstrate that, although some link jitter is always present, we can still assume near-synchronous communication on an FPGA-cluster. This provides the necessary condition for statically-scheduled routing. We then propose two novel router microarchitectures for two different kinds of workloads: a wormhole Virtual Channel (VC)-based router for workloads with dynamic communication, and a statically-scheduled Virtual Output Queueing (VOQ)-based router for workloads with static communication. For the first (VC-based) router, we propose a framework that generates application-aware router configurations. Our results show that, by adding application-awareness into router configuration, the network performance of FPGA clusters can be substantially improved. For the second (VOQ-based) router, we propose a novel offline collective routing algorithm. This shows a significant advantage over a state-of-the-art collective routing algorithm. We apply our communication infrastructure to a critical strong-scaling HPC kernel, the 3D FFT. The experimental results demonstrate that the performance of our design is faster than that on CPUs and GPUs by at least one order of magnitude (achieving strong scaling for the target applications). Surprisingly, the FPGA cluster performance is similar to that of an ASIC-cluster. We also implement the 3D FFT on another multi-FPGA platform: the Microsoft Catapult II cloud. Its performance is also comparable or superior to CPU and GPU HPC clusters. The second application we investigate is Molecular Dynamics Simulation (MD). We model MD on both FPGA clouds and clusters. We find that combining processing and general communication in the same device leads to extremely promising performance and the prospect of MD simulations well into the us/day range with a commodity cloud

    Domain specific high performance reconfigurable architecture for a communication platform

    Get PDF

    MOCAST 2021

    Get PDF
    The 10th International Conference on Modern Circuit and System Technologies on Electronics and Communications (MOCAST 2021) will take place in Thessaloniki, Greece, from July 5th to July 7th, 2021. The MOCAST technical program includes all aspects of circuit and system technologies, from modeling to design, verification, implementation, and application. This Special Issue presents extended versions of top-ranking papers in the conference. The topics of MOCAST include:Analog/RF and mixed signal circuits;Digital circuits and systems design;Nonlinear circuits and systems;Device and circuit modeling;High-performance embedded systems;Systems and applications;Sensors and systems;Machine learning and AI applications;Communication; Network systems;Power management;Imagers, MEMS, medical, and displays;Radiation front ends (nuclear and space application);Education in circuits, systems, and communications

    Survey of FPGA applications in the period 2000 – 2015 (Technical Report)

    Get PDF
    Romoth J, Porrmann M, Rückert U. Survey of FPGA applications in the period 2000 – 2015 (Technical Report).; 2017.Since their introduction, FPGAs can be seen in more and more different fields of applications. The key advantage is the combination of software-like flexibility with the performance otherwise common to hardware. Nevertheless, every application field introduces special requirements to the used computational architecture. This paper provides an overview of the different topics FPGAs have been used for in the last 15 years of research and why they have been chosen over other processing units like e.g. CPUs
    corecore