789 research outputs found
High-Performance VLSI Architectures for Lattice-Based Cryptography
Lattice-based cryptography is a cryptographic primitive built upon the hard problems on point lattices. Cryptosystems relying on lattice-based cryptography have attracted huge attention in the last decade since they have post-quantum-resistant security and the remarkable construction of the algorithm. In particular, homomorphic encryption (HE) and post-quantum cryptography (PQC) are the two main applications of lattice-based cryptography. Meanwhile, the efficient hardware implementations for these advanced cryptography schemes are demanding to achieve a high-performance implementation.
This dissertation aims to investigate the novel and high-performance very large-scale integration (VLSI) architectures for lattice-based cryptography, including the HE and PQC schemes. This dissertation first presents different architectures for the number-theoretic transform (NTT)-based polynomial multiplication, one of the crucial parts of the fundamental arithmetic for lattice-based HE and PQC schemes. Then a high-speed modular integer multiplier is proposed, particularly for lattice-based cryptography. In addition, a novel modular polynomial multiplier is presented to exploit the fast finite impulse response (FIR) filter architecture to reduce the computational complexity of the schoolbook modular polynomial multiplication for lattice-based PQC scheme. Afterward, an NTT and Chinese remainder theorem (CRT)-based high-speed modular polynomial multiplier is presented for HE schemes whose moduli are large integers
Accelerating LTV based homomorphic encryption in reconfigurable hardware
After being introduced in 2009, the first fully homomorphic encryption (FHE) scheme has created significant excitement in academia and industry. Despite rapid advances in the last 6 years, FHE schemes are still not ready for deployment due to an efficiency bottleneck. Here we introduce a custom hardware accelerator optimized for a class of reconfigurable logic to bring LTV based somewhat homomorphic encryption (SWHE) schemes one step closer to deployment in real-life applications. The accelerator we present is connected via a fast PCIe interface to a CPU platform to provide homomorphic evaluation services to any application that needs to support blinded computations. Specifically we introduce a number theoretical transform based multiplier architecture capable of efficiently handling very large polynomials. When synthesized for the Xilinx Virtex 7 family the presented architecture can compute the product of large polynomials in under 6.25 msec making it the fastest multiplier design of its kind currently available in the literature and is more than 102 times faster than a software implementation. Using this multiplier we can compute a relinearization operation in 526 msec. When used as an accelerator, for instance, to evaluate the AES block cipher, we estimate a per block homomorphic evaluation performance of 442 msec yielding performance gains of 28.5 and 17 times over similar CPU and GPU implementations, respectively
PaReNTT: Low-Latency Parallel Residue Number System and NTT-Based Long Polynomial Modular Multiplication for Homomorphic Encryption
High-speed long polynomial multiplication is important for applications in
homomorphic encryption (HE) and lattice-based cryptosystems. This paper
addresses low-latency hardware architectures for long polynomial modular
multiplication using the number-theoretic transform (NTT) and inverse NTT
(iNTT). Chinese remainder theorem (CRT) is used to decompose the modulus into
multiple smaller moduli. Our proposed architecture, namely PaReNTT, makes four
novel contributions. First, parallel NTT and iNTT architectures are proposed to
reduce the number of clock cycles to process the polynomials. This can enable
real-time processing for HE applications, as the number of clock cycles to
process the polynomial is inversely proportional to the level of parallelism.
Second, the proposed architecture eliminates the need for permuting the NTT
outputs before their product is input to the iNTT. This reduces latency by n/4
clock cycles, where n is the length of the polynomial, and reduces buffer
requirement by one delay-switch-delay circuit of size n. Third, an approach to
select special moduli is presented where the moduli can be expressed in terms
of a few signed power-of-two terms. Fourth, novel architectures for
pre-processing for computing residual polynomials using the CRT and
post-processing for combining the residual polynomials are proposed. These
architectures significantly reduce the area consumption of the pre-processing
and post-processing steps. The proposed long modular polynomial multiplications
are ideal for applications that require low latency and high sample rate as
these feed-forward architectures can be pipelined at arbitrary levels
Design and implementation of a fast and scalable NTT-based polynomial multiplier architecture
In this paper, we present an optimized FPGA implementation of a novel, fast and highly parallelized NTT-based polynomial multiplier architecture, which proves to be effective as an accelerator for lattice-based homomorphic cryptographic schemes. As I/O operations are as time-consuming as NTT operations during homomorphic computations in a host processor/accelerator setting, instead of achieving the fastest NTT implementation possible on the target FPGA, we focus on a balanced time performance between the NTT and I/O operations. Even with this goal, we achieved the fastest NTT implementation in literature, to the best of our knowledge. For proof of concept, we utilize our architecture in a framework for Fan-Vercauteren (FV) homomorphic encryption scheme, utilizing a hardware/software co-design approach, in which polynomial multiplication operations are offloaded to the accelerator via PCIe bus while the rest of operations in the FV scheme are executed in software running on an off-the-shelf desktop computer. Specifically, our framework is optimized to accelerate Simple Encrypted Arithmetic Library (SEAL), developed by the Cryptography Research Group at Microsoft Research, for the FV encryption scheme, where large degree polynomial multiplications are utilized extensively. The hardware part of the proposed framework targets Xilinx Virtex-7 FPGA device and the proposed framework achieves almost 11x latency speedup for the offloaded operations compared to their pure software implementations
Accelerating Number Theoretic Transformations for Bootstrappable Homomorphic Encryption on GPUs
Homomorphic encryption (HE) draws huge attention as it provides a way of
privacy-preserving computations on encrypted messages. Number Theoretic
Transform (NTT), a specialized form of Discrete Fourier Transform (DFT) in the
finite field of integers, is the key algorithm that enables fast computation on
encrypted ciphertexts in HE. Prior works have accelerated NTT and its inverse
transformation on a popular parallel processing platform, GPU, by leveraging
DFT optimization techniques. However, these GPU-based studies lack a
comprehensive analysis of the primary differences between NTT and DFT or only
consider small HE parameters that have tight constraints in the number of
arithmetic operations that can be performed without decryption. In this paper,
we analyze the algorithmic characteristics of NTT and DFT and assess the
performance of NTT when we apply the optimizations that are commonly applicable
to both DFT and NTT on modern GPUs. From the analysis, we identify that NTT
suffers from severe main-memory bandwidth bottleneck on large HE parameter
sets. To tackle the main-memory bandwidth issue, we propose a novel
NTT-specific on-the-fly root generation scheme dubbed on-the-fly twiddling
(OT). Compared to the baseline radix-2 NTT implementation, after applying all
the optimizations, including OT, we achieve 4.2x speedup on a modern GPU.Comment: 12 pages, 13 figures, to appear in IISWC 202
- …