105 research outputs found

    SoK: Fully Homomorphic Encryption Accelerators

    Full text link
    Fully Homomorphic Encryption~(FHE) is a key technology enabling privacy-preserving computing. However, the fundamental challenge of FHE is its inefficiency, due primarily to the underlying polynomial computations with high computation complexity and extremely time-consuming ciphertext maintenance operations. To tackle this challenge, various FHE accelerators have recently been proposed by both research and industrial communities. This paper takes the first initiative to conduct a systematic study on the 14 FHE accelerators -- cuHE/cuFHE, nuFHE, HEAT, HEAX, HEXL, HEXL-FPGA, 100×\times, F1, CraterLake, BTS, ARK, Poseidon, FAB and TensorFHE. We first make our observations on the evolution trajectory of these existing FHE accelerators to establish a qualitative connection between them. Then, we perform testbed evaluations of representative open-source FHE accelerators to provide a quantitative comparison on them. Finally, with the insights learned from both qualitative and quantitative studies, we discuss potential directions to inform the future design and implementation for FHE accelerators

    Accelerating Number Theoretic Transformations for Bootstrappable Homomorphic Encryption on GPUs

    Full text link
    Homomorphic encryption (HE) draws huge attention as it provides a way of privacy-preserving computations on encrypted messages. Number Theoretic Transform (NTT), a specialized form of Discrete Fourier Transform (DFT) in the finite field of integers, is the key algorithm that enables fast computation on encrypted ciphertexts in HE. Prior works have accelerated NTT and its inverse transformation on a popular parallel processing platform, GPU, by leveraging DFT optimization techniques. However, these GPU-based studies lack a comprehensive analysis of the primary differences between NTT and DFT or only consider small HE parameters that have tight constraints in the number of arithmetic operations that can be performed without decryption. In this paper, we analyze the algorithmic characteristics of NTT and DFT and assess the performance of NTT when we apply the optimizations that are commonly applicable to both DFT and NTT on modern GPUs. From the analysis, we identify that NTT suffers from severe main-memory bandwidth bottleneck on large HE parameter sets. To tackle the main-memory bandwidth issue, we propose a novel NTT-specific on-the-fly root generation scheme dubbed on-the-fly twiddling (OT). Compared to the baseline radix-2 NTT implementation, after applying all the optimizations, including OT, we achieve 4.2x speedup on a modern GPU.Comment: 12 pages, 13 figures, to appear in IISWC 202

    Energy area and speed optimized signal processing on FPGA

    Get PDF
    Matrix multiplication and Fast Fourier transform are two computational intensive DSP functions widely used as kernel operations in the applications such as graphics, imaging and wireless communication. Traditionally the performance metrics for signal processing has been latency and throughput. Energy efficiency has become increasingly important with proliferation of portable mobile devices as in software defined radio. A FPGA based system is a viable solution for requirement of adaptability and high computational power. But one limitation in FPGA is the limitation of resources. So there is need for optimization between energy, area and latency. There are numerous ways to map an algorithm to FPGA. So for the process of optimization the parameters must be determined by low level simulation of each of the designs possible which gives rise to vast time consumption. So there is need for a high level energy model in which parameters can be determined at algorithm and architectural level rather than low level simulation. In this dissertation matrix multiplication algorithms are implemented with pipelining and parallel processing features to increase throughput and reduce latency there by reduce the energy dissipation. But it increases area by the increased numbers of processing elements. The major area of the design is used by multiplier which further increases with increase in input word width which is difficult for VLSI implementation. So a word width decomposition technique is used with these algorithms to keep the size of multipliers fixed irrespective of the width of input data. FFT algorithms are implemented with pipelining to increase throughput. To reduce energy and area due to the complex multipliers used in the design for multiplication with twiddle factors, distributed arithmetic is used to provide multiplier less architecture. To compensate speed performance parallel distributed arithmetic models are used. This dissertation also proposes method of optimization of the parameters at high level for these two kernel applications by constructing a high level energy model using specified algorithms and architectures. Results obtained from the model are compared with those obtained from low level simulation for estimation of error

    An FPGA-based Embedded System For Fingerprint Matching Using Phase Only Correlation Algorithm

    Get PDF
    none5There is an increasing interest in inexpensive and reliable personal identification in many emerging civilian, commercial and financial applications. Traditional systems such as passwords, PINs, Badges, Smart Cards and Tokens may either be stolen or easy to guess but also to forget, in same cases they may be lost by the user who carries them; all this can lead to identified. Fingerprint-based identification is one of the most used biometric techniques in automated systems for personal identification and it is becoming socially acceptable and cost-effective, since a fingerprint is univocally related to a particular individual. Typical fingerprint identification methods employ feature-based image matching, where minutiae points in the ridge lines (i.e., ridge endings and bifurcations) are identified. Unfortunately this approach is highly influenced by fingertip surface condition. Fingerprint recognition is a complex pattern recognition problem. The efforts to make automatic the matching process based on digital representation of fingerprints, led to the development of Automatic Fingerprint Identification Systems (AFIS). Typically, there are millions of fingerprint records in a database which needs to be entirely searched for a match, to establish the identity of the individual. In order to provide a reasonable response time for each query, it will be better to develop special hardware solutions to implement matching and/or classification algorithms in a really efficient way. In this work we realised a system able to outperform modern PCs in recognising and classifying fingerprints and based on FPGA technology.Il lavoro si è classificato al II posto nell'Altera Contest 2009 Innovate Italy, gara annuale indetta da Altera tra progetti di team di giovani studenti universitari su tutto il territorio nazionale.Giovanni Danese; Mauro Giachero; Francesco Leporati; Giulia Matrone; Nelson NazzicariDanese, Giovanni; Giachero, Mauro; Leporati, Francesco; Matrone, Giulia; Nelson, Nazzicar

    Compact Ring-LWE Cryptoprocessor

    Full text link
    Abstract. In this paper we propose an efficient and compact processor for a ring-LWE based encryption scheme. We present three optimizations for the Num-ber Theoretic Transform (NTT) used for polynomial multiplication: we avoid pre-processing in the negative wrapped convolution by merging it with the main algo-rithm, we reduce the fixed computation cost of the twiddle factors and propose an advanced memory access scheme. These optimization techniques reduce both the cycle and memory requirements. Finally, we also propose an optimization of the ring-LWE encryption system that reduces the number of NTT operations from five to four resulting in a 20 % speed-up. We use these computational optimiza-tions along with several architectural optimizations to design an instruction-set ring-LWE cryptoprocessor. For dimension 256, our processor performs encryp-tion/decryption operations in 20/9 µs on a Virtex 6 FPGA and only requires 1349 LUTs, 860 FFs, 1 DSP-MULT and 2 BRAMs. Similarly for dimension 512, the processor takes 48/21 µs for performing encryption/decryption operations and only requires 1536 LUTs, 953 FFs, 1 DSP-MULT and 3 BRAMs. Our pro-cessors are therefore more than three times smaller than the current state of the art hardware implementations, whilst running somewhat faster

    Circuito electrónico digital para el cálculo de senos y cosenos de múltiplos de un ángulo

    Get PDF
    Circuito electrónico pare el cálculo de los senos y cosenos de múltiplos de un ángulo que permite implementar eficientemente el cómputo de loe factores de twiddle de la transformada de Fourier.Españ

    FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption

    Full text link
    Fully Homomorphic Encryption is a technique that allows computation on encrypted data. It has the potential to change privacy considerations in the cloud, but computational and memory overheads are preventing its adoption. TFHE is a promising Torus-based FHE scheme that relies on bootstrapping, the noise-removal tool invoked after each encrypted logical/arithmetical operation. We present FPT, a Fixed-Point FPGA accelerator for TFHE bootstrapping. FPT is the first hardware accelerator to exploit the inherent noise present in FHE calculations. Instead of double or single-precision floating-point arithmetic, it implements TFHE bootstrapping entirely with approximate fixed-point arithmetic. Using an in-depth analysis of noise propagation in bootstrapping FFT computations, FPT is able to use noise-trimmed fixed-point representations that are up to 50% smaller than prior implementations. FPT is built as a streaming processor inspired by traditional streaming DSPs: it instantiates directly cascaded high-throughput computational stages, with minimal control logic and routing networks. We explore throughput-balanced compositions of streaming kernels with a user-configurable streaming width in order to construct a full bootstrapping pipeline. Our approach allows 100% utilization of arithmetic units and requires only a small bootstrapping key cache, enabling an entirely compute-bound bootstrapping throughput of 1 BS / 35us. This is in stark contrast to the classical CPU approach to FHE bootstrapping acceleration, which is typically constrained by memory and bandwidth. FPT is implemented and evaluated as a bootstrapping FPGA kernel for an Alveo U280 datacenter accelerator card. FPT achieves two to three orders of magnitude higher bootstrapping throughput than existing CPU-based implementations, and 2.5x higher throughput compared to recent ASIC emulation experiments.Comment: ACM CCS 202
    corecore