493 research outputs found

    Towards zero latency photonic switching in shared memory networks

    Get PDF
    Photonic networks-on-chip based on silicon photonics have been proposed to reduce latency and power consumption in future chip multi-core processors (CMP). However, high performance CMPs use a shared memory model which generates large numbers of short messages, creating high arbitration latency overhead for photonic switching networks. In this paper we explore techniques which intelligently use information from the memory hierarchy to predict communication in order to setup photonic circuits with reduced or eliminated arbitration latency. Firstly, we present a switch scheduling algorithm which arbitrates on a per memory transaction basis and holds open photonic circuits to exploit temporal locality. We show that this can reduce the average arbitration latency overhead by 60% and eliminate arbitration latency altogether for a signi cant proportion of memory transactions. We then show how this technique can be applied to multiple-socket shared memory systems with low latency and energy consumption penalties. Finally, we present ideas and initial results to demonstrate that cache miss prediction could be used to set up photonic circuits for more complex memory transactions and main memory accesses

    Hardware optimizations of dense binary hyperdimensional computing: Rematerialization of hypervectors, binarized bundling, and combinational associative memory

    Get PDF
    Brain-inspired hyperdimensional (HD) computing models neural activity patterns of the very size of the brain's circuits with points of a hyperdimensional space, that is, with hypervectors. Hypervectors are Ddimensional (pseudo)random vectors with independent and identically distributed (i.i.d.) components constituting ultra-wide holographic words: D = 10,000 bits, for instance. At its very core, HD computing manipulates a set of seed hypervectors to build composite hypervectors representing objects of interest. It demands memory optimizations with simple operations for an efficient hardware realization. In this article, we propose hardware techniques for optimizations of HD computing, in a synthesizable open-source VHDL library, to enable co-located implementation of both learning and classification tasks on only a small portion of Xilinx UltraScale FPGAs: (1)We propose simple logical operations to rematerialize the hypervectors on the fly rather than loading them from memory. These operations massively reduce the memory footprint by directly computing the composite hypervectors whose individual seed hypervectors do not need to be stored in memory. (2) Bundling a series of hypervectors over time requires a multibit counter per every hypervector component. We instead propose a binarized back-to-back bundling without requiring any counters. This truly enables onchip learning with minimal resources as every hypervector component remains binary over the course of training to avoid otherwise multibit components. (3) For every classification event, an associative memory is in charge of finding the closest match between a set of learned hypervectors and a query hypervector by using a distance metric. This operator is proportional to hypervector dimension (D), and hence may take O(D) cycles per classification event. Accordingly, we significantly improve the throughput of classification by proposing associative memories that steadily reduce the latency of classification to the extreme of a single cycle. (4) We perform a design space exploration incorporating the proposed techniques on FPGAs for a wearable biosignal processing application as a case study. Our techniques achieve up to 2.39 7 area saving, or 2,337 7 throughput improvement. The Pareto optimal HD architecture is mapped on only 18,340 configurable logic blocks (CLBs) to learn and classify five hand gestures using four electromyography sensors

    Multi-Unit Serial Polynomial Multiplier to Accelerate NTRU-Based Cryptographic Schemes in IoT Embedded Systems

    Get PDF
    Concern for the security of embedded systems that implement IoT devices has become a crucial issue, as these devices today support an increasing number of applications and services that store and exchange information whose integrity, privacy, and authenticity must be adequately guaranteed. Modern lattice-based cryptographic schemes have proven to be a good alternative, both to face the security threats that arise as a consequence of the development of quantum computing and to allow efficient implementations of cryptographic primitives in resource-limited embedded systems, such as those used in consumer and industrial applications of the IoT. This article describes the hardware implementation of parameterized multi-unit serial polynomial multipliers to speed up time-consuming operations in NTRU-based cryptographic schemes. The flexibility in selecting the design parameters and the interconnection protocol with a general-purpose processor allow them to be applied both to the standardized variants of NTRU and to the new proposals that are being considered in the post-quantum contest currently held by the National Institute of Standards and Technology, as well as to obtain an adequate cost/performance/security-level trade-off for a target application. The designs are provided as AXI4 bus-compliant intellectual property modules that can be easily incorporated into embedded systems developed with the Vivado design tools. The work provides an extensive set of implementation and characterization results in devices of the Xilinx Zynq-7000 and Zynq UltraScale+ families for the different sets of parameters defined in the NTRUEncrypt standard. It also includes details of their plug and play inclusion as hardware accelerators in the C implementation of this public-key encryption scheme codified in the LibNTRU library, showing that acceleration factors of up to 3.1 are achieved when compared to pure software implementations running on the processing systems included in the programmable devices.European Union 952622Ministerio de Ciencia e Innovación PID2020-116664RB100, 10.13039/50110001103

    Performance realization of Bridge Model using Ethernet-MAC for NoC based system with FPGA Prototyping

    Get PDF
    The System on Chip (SoC) integrates the number of processing elements (PE) with different application requirements on a single chip. The SoC uses bus-based interconnection with shared memory access. However, buses are not scalable and limited to particular interface protocol. To overcome these problems, The Network on Chip (NoC) is an emerging interconnect solution with a scalable and reliable solution over SoC. The bridge model is essential to communicate the NoC based system on SoC. In this article, a cost-effective and efficient bridge model with ethernet-MAC is designed and also the placement of the bride with NoC based system is prototyped on Artix-7 FPGA. The Bridge model mainly contains FIFO modules, Serializer and de-serializer, priority-based arbiter with credit counter, packet framer and packet parser with Ethernet-MAC transceiver Module. The bridge with a single router and different sizes of the NoC based systems with mesh topology are designed using adaptive-XY routing. The performance metrics are evaluated for bridge with NoC in terms of average latency and maximum throughput for different Packet Injection Rate (PIR)

    Proceedings of the 5th International Workshop on Reconfigurable Communication-centric Systems on Chip 2010 - ReCoSoC\u2710 - May 17-19, 2010 Karlsruhe, Germany. (KIT Scientific Reports ; 7551)

    Get PDF
    ReCoSoC is intended to be a periodic annual meeting to expose and discuss gathered expertise as well as state of the art research around SoC related topics through plenary invited papers and posters. The workshop aims to provide a prospective view of tomorrow\u27s challenges in the multibillion transistor era, taking into account the emerging techniques and architectures exploring the synergy between flexible on-chip communication and system reconfigurability

    D2.1 - Report on Selected TRNG and PUF Principles

    Get PDF
    This report represents the final version of Deliverable 2.1 of the HECTOR work package WP2. It is a result of discussions and work on Task 2.1 of all HECTOR partners involved in WP2. The aim of the Deliverable 2.1 is to select principles of random number generators (RNGs) and physical unclonable functions (PUFs) that fulfill strict technology, design and security criteria. For example, the selected RNGs must be suitable for implementation in logic devices according to the German AIS20/31 standard. Correspondingly, the selected PUFs must be suitable for applying similar security approach. A standard PUF evaluation approach does not exist, yet, but it should be proposed in the framework of the project. Selected RNGs and PUFs should be then thoroughly evaluated from the point of view of security and the most suitable principles should be implemented in logic devices, such as Field Programmable Logic Arrays (FPGAs) and Application Specific Integrated Circuits (ASICs) during the next phases of the project

    Improved Circuit Synthesis with Amortized Bootstrapping for FHEW-like Schemes

    Get PDF
    In recent years, the research community has made great progress in improving techniques for privacy-preserving computation such as fully homomorphic encryption (FHE). Despite the progress, there remain open challenges, mostly in the areas of performance and usability, to further advance the adoption of these technologies. This work provides multiple contributions to improve the current state-of-the-art in both areas. More specifically, we significantly simplify the bootstrapping idea by Carpov, Izabach`ene, and Mollimard [1] for Boolean-based FHE schemes such as FHEW or TFHE, making the concept usable in practice. Based on our simplifications, we provide an easy-to-use interface for amortized bootstrapping implementing our improvements in the open-source library FHE-Deck and provide new parameter sets for multi-bit encryptions with state-of-the-art security. We build a toolset that compiles high-level code such as C++ to code that executes operations on encrypted data. For this toolset, we propose the first non-trivial FHE-specific optimizations in synthesizing privacy-preserving circuits from high-level code, namely look-up table (LUT) grouping and adder substitution. Using LUT grouping, we reduce the number of bootstrapping required by almost 35 % on average, while for adder substitution, we reduce the number of required bootstrapping by up to 80 % for certain use cases. Overall, the execution time is up to 3.8× faster using our optimizations compared to previous state-of-the-art circuit synthesis

    PaReNTT: Low-Latency Parallel Residue Number System and NTT-Based Long Polynomial Modular Multiplication for Homomorphic Encryption

    Full text link
    High-speed long polynomial multiplication is important for applications in homomorphic encryption (HE) and lattice-based cryptosystems. This paper addresses low-latency hardware architectures for long polynomial modular multiplication using the number-theoretic transform (NTT) and inverse NTT (iNTT). Chinese remainder theorem (CRT) is used to decompose the modulus into multiple smaller moduli. Our proposed architecture, namely PaReNTT, makes four novel contributions. First, parallel NTT and iNTT architectures are proposed to reduce the number of clock cycles to process the polynomials. This can enable real-time processing for HE applications, as the number of clock cycles to process the polynomial is inversely proportional to the level of parallelism. Second, the proposed architecture eliminates the need for permuting the NTT outputs before their product is input to the iNTT. This reduces latency by n/4 clock cycles, where n is the length of the polynomial, and reduces buffer requirement by one delay-switch-delay circuit of size n. Third, an approach to select special moduli is presented where the moduli can be expressed in terms of a few signed power-of-two terms. Fourth, novel architectures for pre-processing for computing residual polynomials using the CRT and post-processing for combining the residual polynomials are proposed. These architectures significantly reduce the area consumption of the pre-processing and post-processing steps. The proposed long modular polynomial multiplications are ideal for applications that require low latency and high sample rate as these feed-forward architectures can be pipelined at arbitrary levels

    New VLSI design of a MAP/BCJR decoder.

    Get PDF
    Any communication channel suffers from different kinds of noises. By employing forward error correction (FEC) techniques, the reliability of the communication channel can be increased. One of the emerging FEC methods is turbo coding (iterative coding), which employs soft input soft output (SISO) decoding algorithms like maximum a posteriori (MAP) algorithm in its constituent decoders. In this thesis we introduce a design with lower complexity and less than 0.1dB performance loss compare to the best performance observed in Max-Log-MAP algorithm. A parallel and pipeline design of a MAP decoder suitable for ASIC (Application Specific Integrated Circuits) is used to increase the throughput of the chip. The branch metric calculation unit is studied in detail and a new design with lower complexity is proposed. The design is also flexible to communication block sizes, which makes it ideal for variable frame length communication systems. A new even-spaced quantization technique for the proposed MAP decoder is utilized. Normalization techniques are studied and a suitable technique for the Max-Log-MAP decoder is explained. The decoder chip is synthesized and implemented in a 0.18 mum six-layer metal CMOS technology. (Abstract shortened by UMI.)Dept. of Electrical and Computer Engineering. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2004 .S23. Source: Masters Abstracts International, Volume: 43-05, page: 1783. Adviser: Majid Ahmadi. Thesis (M.A.Sc.)--University of Windsor (Canada), 2004
    • …
    corecore