105 research outputs found
SoK: Fully Homomorphic Encryption Accelerators
Fully Homomorphic Encryption~(FHE) is a key technology enabling
privacy-preserving computing. However, the fundamental challenge of FHE is its
inefficiency, due primarily to the underlying polynomial computations with high
computation complexity and extremely time-consuming ciphertext maintenance
operations. To tackle this challenge, various FHE accelerators have recently
been proposed by both research and industrial communities. This paper takes the
first initiative to conduct a systematic study on the 14 FHE accelerators --
cuHE/cuFHE, nuFHE, HEAT, HEAX, HEXL, HEXL-FPGA, 100, F1, CraterLake,
BTS, ARK, Poseidon, FAB and TensorFHE. We first make our observations on the
evolution trajectory of these existing FHE accelerators to establish a
qualitative connection between them. Then, we perform testbed evaluations of
representative open-source FHE accelerators to provide a quantitative
comparison on them. Finally, with the insights learned from both qualitative
and quantitative studies, we discuss potential directions to inform the future
design and implementation for FHE accelerators
Accelerating Number Theoretic Transformations for Bootstrappable Homomorphic Encryption on GPUs
Homomorphic encryption (HE) draws huge attention as it provides a way of
privacy-preserving computations on encrypted messages. Number Theoretic
Transform (NTT), a specialized form of Discrete Fourier Transform (DFT) in the
finite field of integers, is the key algorithm that enables fast computation on
encrypted ciphertexts in HE. Prior works have accelerated NTT and its inverse
transformation on a popular parallel processing platform, GPU, by leveraging
DFT optimization techniques. However, these GPU-based studies lack a
comprehensive analysis of the primary differences between NTT and DFT or only
consider small HE parameters that have tight constraints in the number of
arithmetic operations that can be performed without decryption. In this paper,
we analyze the algorithmic characteristics of NTT and DFT and assess the
performance of NTT when we apply the optimizations that are commonly applicable
to both DFT and NTT on modern GPUs. From the analysis, we identify that NTT
suffers from severe main-memory bandwidth bottleneck on large HE parameter
sets. To tackle the main-memory bandwidth issue, we propose a novel
NTT-specific on-the-fly root generation scheme dubbed on-the-fly twiddling
(OT). Compared to the baseline radix-2 NTT implementation, after applying all
the optimizations, including OT, we achieve 4.2x speedup on a modern GPU.Comment: 12 pages, 13 figures, to appear in IISWC 202
Energy area and speed optimized signal processing on FPGA
Matrix multiplication and Fast Fourier transform are two computational intensive DSP functions widely used as kernel operations in the applications such as graphics, imaging and wireless communication. Traditionally the performance metrics for signal processing has been latency and throughput. Energy efficiency has become increasingly important with proliferation of portable mobile devices as in software defined radio. A FPGA based system is a viable solution for requirement of adaptability and high computational power. But one limitation in FPGA is the limitation of resources. So there is need for optimization between energy, area and latency. There are numerous ways to map an algorithm to FPGA. So for the process of optimization the parameters must be determined by low level simulation of each of the designs possible which gives rise to vast time consumption. So there is need for a high level energy model in which parameters can be determined at algorithm and architectural level rather than low level simulation. In this dissertation matrix multiplication algorithms are implemented with pipelining and parallel processing features to increase throughput and reduce latency there by reduce the energy dissipation. But it increases area by the increased numbers of processing elements. The major area of the design is used by multiplier which further increases with increase in input word width which is difficult for VLSI implementation. So a word width decomposition technique is used with these algorithms to keep the size of multipliers fixed irrespective of the width of input data. FFT algorithms are implemented with pipelining to increase throughput. To reduce energy and area due to the complex multipliers used in the design for multiplication with twiddle factors, distributed arithmetic is used to provide multiplier less architecture. To compensate speed performance parallel distributed arithmetic models are used. This dissertation also proposes method of optimization of the parameters at high level for these two kernel applications by constructing a high level energy model using specified algorithms and architectures. Results obtained from the model are compared with those obtained from low level simulation for estimation of error
An FPGA-based Embedded System For Fingerprint Matching Using Phase Only Correlation Algorithm
none5There is an increasing interest in inexpensive and reliable personal identification in many emerging civilian, commercial and financial applications.
Traditional systems such as passwords, PINs, Badges, Smart Cards and Tokens may either be stolen or easy to guess but also to forget, in same cases they may be lost by the user who carries them; all this can lead to identified.
Fingerprint-based identification is one of the most used biometric techniques in automated systems for personal identification and it is becoming socially acceptable and cost-effective, since a fingerprint is univocally related to a particular individual.
Typical fingerprint identification methods employ feature-based image matching, where minutiae points in the ridge lines (i.e., ridge endings and bifurcations) are identified. Unfortunately this approach is highly influenced by fingertip surface condition.
Fingerprint recognition is a complex pattern recognition problem. The efforts to make automatic the matching process based on digital representation of fingerprints, led to the development of Automatic Fingerprint Identification Systems (AFIS). Typically, there are millions of fingerprint records in a database which needs to be entirely searched for a match, to establish the identity of the individual.
In order to provide a reasonable response time for each query, it will be better to develop special hardware solutions to implement matching and/or classification algorithms in a really efficient way.
In this work we realised a system able to outperform modern PCs in recognising and classifying fingerprints and based on FPGA technology.Il lavoro si è classificato al II posto nell'Altera Contest 2009 Innovate Italy, gara annuale indetta da Altera tra progetti di team di giovani studenti universitari su tutto il territorio nazionale.Giovanni Danese; Mauro Giachero; Francesco Leporati; Giulia Matrone; Nelson NazzicariDanese, Giovanni; Giachero, Mauro; Leporati, Francesco; Matrone, Giulia; Nelson, Nazzicar
Compact Ring-LWE Cryptoprocessor
Abstract. In this paper we propose an efficient and compact processor for a ring-LWE based encryption scheme. We present three optimizations for the Num-ber Theoretic Transform (NTT) used for polynomial multiplication: we avoid pre-processing in the negative wrapped convolution by merging it with the main algo-rithm, we reduce the fixed computation cost of the twiddle factors and propose an advanced memory access scheme. These optimization techniques reduce both the cycle and memory requirements. Finally, we also propose an optimization of the ring-LWE encryption system that reduces the number of NTT operations from five to four resulting in a 20 % speed-up. We use these computational optimiza-tions along with several architectural optimizations to design an instruction-set ring-LWE cryptoprocessor. For dimension 256, our processor performs encryp-tion/decryption operations in 20/9 µs on a Virtex 6 FPGA and only requires 1349 LUTs, 860 FFs, 1 DSP-MULT and 2 BRAMs. Similarly for dimension 512, the processor takes 48/21 µs for performing encryption/decryption operations and only requires 1536 LUTs, 953 FFs, 1 DSP-MULT and 3 BRAMs. Our pro-cessors are therefore more than three times smaller than the current state of the art hardware implementations, whilst running somewhat faster
Circuito electrónico digital para el cálculo de senos y cosenos de múltiplos de un ángulo
Circuito electrónico pare el cálculo de los senos y
cosenos de múltiplos de un ángulo que permite
implementar eficientemente el cómputo de loe
factores de twiddle de la transformada de Fourier.Españ
FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption
Fully Homomorphic Encryption is a technique that allows computation on
encrypted data. It has the potential to change privacy considerations in the
cloud, but computational and memory overheads are preventing its adoption. TFHE
is a promising Torus-based FHE scheme that relies on bootstrapping, the
noise-removal tool invoked after each encrypted logical/arithmetical operation.
We present FPT, a Fixed-Point FPGA accelerator for TFHE bootstrapping. FPT is
the first hardware accelerator to exploit the inherent noise present in FHE
calculations. Instead of double or single-precision floating-point arithmetic,
it implements TFHE bootstrapping entirely with approximate fixed-point
arithmetic. Using an in-depth analysis of noise propagation in bootstrapping
FFT computations, FPT is able to use noise-trimmed fixed-point representations
that are up to 50% smaller than prior implementations.
FPT is built as a streaming processor inspired by traditional streaming DSPs:
it instantiates directly cascaded high-throughput computational stages, with
minimal control logic and routing networks. We explore throughput-balanced
compositions of streaming kernels with a user-configurable streaming width in
order to construct a full bootstrapping pipeline. Our approach allows 100%
utilization of arithmetic units and requires only a small bootstrapping key
cache, enabling an entirely compute-bound bootstrapping throughput of 1 BS /
35us. This is in stark contrast to the classical CPU approach to FHE
bootstrapping acceleration, which is typically constrained by memory and
bandwidth.
FPT is implemented and evaluated as a bootstrapping FPGA kernel for an Alveo
U280 datacenter accelerator card. FPT achieves two to three orders of magnitude
higher bootstrapping throughput than existing CPU-based implementations, and
2.5x higher throughput compared to recent ASIC emulation experiments.Comment: ACM CCS 202
- …