18 research outputs found
Accelerating Number Theoretic Transformations for Bootstrappable Homomorphic Encryption on GPUs
Homomorphic encryption (HE) draws huge attention as it provides a way of
privacy-preserving computations on encrypted messages. Number Theoretic
Transform (NTT), a specialized form of Discrete Fourier Transform (DFT) in the
finite field of integers, is the key algorithm that enables fast computation on
encrypted ciphertexts in HE. Prior works have accelerated NTT and its inverse
transformation on a popular parallel processing platform, GPU, by leveraging
DFT optimization techniques. However, these GPU-based studies lack a
comprehensive analysis of the primary differences between NTT and DFT or only
consider small HE parameters that have tight constraints in the number of
arithmetic operations that can be performed without decryption. In this paper,
we analyze the algorithmic characteristics of NTT and DFT and assess the
performance of NTT when we apply the optimizations that are commonly applicable
to both DFT and NTT on modern GPUs. From the analysis, we identify that NTT
suffers from severe main-memory bandwidth bottleneck on large HE parameter
sets. To tackle the main-memory bandwidth issue, we propose a novel
NTT-specific on-the-fly root generation scheme dubbed on-the-fly twiddling
(OT). Compared to the baseline radix-2 NTT implementation, after applying all
the optimizations, including OT, we achieve 4.2x speedup on a modern GPU.Comment: 12 pages, 13 figures, to appear in IISWC 202
CiFHER: A Chiplet-Based FHE Accelerator with a Resizable Structure
Fully homomorphic encryption (FHE) is in the spotlight as a definitive
solution for privacy, but the high computational overhead of FHE poses a
challenge to its practical adoption. Although prior studies have attempted to
design ASIC accelerators to mitigate the overhead, their designs require
excessive amounts of chip resources (e.g., areas) to contain and process
massive data for FHE operations.
We propose CiFHER, a chiplet-based FHE accelerator with a resizable
structure, to tackle the challenge with a cost-effective multi-chip module
(MCM) design. First, we devise a flexible architecture of a chiplet core whose
configuration can be adjusted to conform to the global organization of chiplets
and design constraints. The distinctive feature of our core is a recomposable
functional unit providing varying computational throughput for number-theoretic
transform (NTT), the most dominant function in FHE. Then, we establish
generalized data mapping methodologies to minimize the network overhead when
organizing the chips into the MCM package in a tiled manner, which becomes a
significant bottleneck due to the technology constraints of MCMs. Also, we
analyze the effectiveness of various algorithms, including a novel limb
duplication algorithm, on the MCM architecture. A detailed evaluation shows
that a CiFHER package composed of 4 to 64 compact chiplets provides performance
comparable to state-of-the-art monolithic ASIC FHE accelerators with
significantly lower package-wide power consumption while reducing the area of a
single core to as small as 4.28mm.Comment: 15 pages, 9 figure
GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryption
Fully Homomorphic Encryption (FHE) enables the processing of encrypted data
without decrypting it. FHE has garnered significant attention over the past
decade as it supports secure outsourcing of data processing to remote cloud
services. Despite its promise of strong data privacy and security guarantees,
FHE introduces a slowdown of up to five orders of magnitude as compared to the
same computation using plaintext data. This overhead is presently a major
barrier to the commercial adoption of FHE.
In this work, we leverage GPUs to accelerate FHE, capitalizing on a
well-established GPU ecosystem available in the cloud. We propose GME, which
combines three key microarchitectural extensions along with a compile-time
optimization to the current AMD CDNA GPU architecture. First, GME integrates a
lightweight on-chip compute unit (CU)-side hierarchical interconnect to retain
ciphertext in cache across FHE kernels, thus eliminating redundant memory
transactions. Second, to tackle compute bottlenecks, GME introduces special
MOD-units that provide native custom hardware support for modular reduction
operations, one of the most commonly executed sets of operations in FHE. Third,
by integrating the MOD-unit with our novel pipelined -bit integer
arithmetic cores (WMAC-units), GME further accelerates FHE workloads by .
Finally, we propose a Locality-Aware Block Scheduler (LABS) that exploits the
temporal locality available in FHE primitive blocks. Incorporating these
microarchitectural features and compiler optimizations, we create a synergistic
approach achieving average speedups of , , and
over Intel Xeon CPU, NVIDIA V100 GPU, and Xilinx FPGA
implementations, respectively
GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryption
Fully Homomorphic Encryption (FHE) enables the processing of encrypted data without decrypting it. FHE has garnered significant attention over the past decade as it supports secure outsourcing of data processing to remote cloud services. Despite its promise of strong data privacy and security guarantees, FHE introduces a slowdown of up to five orders of magnitude as compared to the same computation using plaintext data. This overhead is presently a major barrier to the commercial adoption of FHE. While prior efforts recommend moving to custom accelerators to accelerate FHE computing, these solutions lack cost-effectiveness and scalability. In this work, we leverage GPUs to accelerate FHE, capitalizing on a well-established GPU ecosystem that is available in the cloud. We propose GME, which combines three key microarchitectural extensions along with a compile-time optimization to the current AMD CDNA GPU architecture. First, GME integrates a lightweight on-chip compute unit (CU)-side hierarchical interconnect to retain ciphertext in cache across FHE kernels, thus eliminating redundant memory transactions and improving performance. Second, to tackle compute bottlenecks, GME introduces special MOD-units that provide native custom hardware support for modular reduction
operations, one of the most commonly executed sets of operations in FHE. Third, by integrating the MOD-unit with our novel pipelined 64-bit integer arithmetic cores (WMAC-units), GME further accelerates FHE workloads by 19%. Finally, we propose a Locality-Aware Block Scheduler (LABS) that improves FHE workload performance, exploiting the temporal locality available in FHE primitive blocks. Incorporating these microarchitectural features and compiler optimizations, we create a synergistic approach achieving average speedups of 796×, 14.2×, and 2.3× over Intel Xeon CPU, NVIDIA V100 GPU, and Xilinx FPGA implementations, respectively
Practical MPC+FHE with Applications in Secure Multi-PartyNeural Network Evaluation
The theoretical idea of using FHE to realize MPC has been therefor over a decade. Existing threshold (and multi-key) FHE schemes were constructed by modifying and analyzing a traditional single-keyFHE in a case-by-case manner, thus technically highly-demanding.This work explores a new approach to build threshold FHE (therebyMPC schemes) through tailoring generic MPC protocols to the base FHE scheme while requiring no effort in FHE redesign. We applied our approach to two representative Ring-LWE-based FHE schemes: CKKS and GHS, producing GMPFHE-CKKS and GMPFHE-GHS. We developed MPC protocols based on GMPFHE-CKKS and GMPFHE-GHS which are secure against any number of passive but colluding adversaries. The online cost of our MPC protocol is , as opposed to for existing MPC protocols, and our offline cost is independent of . We experimentally show that the GMPFHE-CKKS-based MPC protocol offers unparalleled amortized performance on multi-party neural network evaluation
High-precision RNS-CKKS on fixed but smaller word-size architectures: theory and application
A prevalent issue in the residue number system (RNS) variant of the Cheon-Kim-Kim-Song (CKKS) homomorphic encryption (HE) scheme is the challenge of efficiently achieving high precision on hardware architectures with a fixed, yet smaller, word-size of bit-length , especially when the scaling factor satisfies .
In this work, we introduce an efficient solution termed composite scaling. In this approach, we group multiple RNS primes as such that for , and use each composite in the rescaling procedure as . Here, the number of primes, denoted by , is termed the composition degree. This strategy contrasts the traditional rescaling method in RNS-CKKS, where each is chosen as a single -bit prime, a method we designate as single scaling.
To achieve higher precision in single scaling, where , one would either need a novel hardware architecture with word size or would have to resort to relatively inefficient solutions rooted in multi-precision arithmetic. This problem, however, doesn\u27t arise in composite scaling. In the composite scaling approach, the larger the composition degree , the greater the precision attainable with RNS-CKKS across an extensive range of secure parameters tailored for workload deployment.
We have integrated composite scaling RNS-CKKS into both OpenFHE and Lattigo libraries. This integration was achieved via a concrete implementation of the method and its application to the most up-to-date workloads, specifically, logistic regression training and convolutional neural network inference. Our experiments demonstrate that single and composite scaling approaches are functionally equivalent, both theoretically and practically
Performance of Hierarchical Transforms in Homomorphic Encryption: A case study on Logistic Regression inference
Recent works challenged the Number-Theoretic Transform (NTT) as the most efficient method for
polynomial multiplication in GPU implementations of Fully Homomorphic Encryption schemes such
as CKKS and BFV. In particular, these works argue that the Discrete Galois Transform (DGT) is a
better candidate for this particular case. However, these claims were never rigorously validated, and
only intuition was used to argue in favor of each transform. This work brings some light on the dis-
cussion by developing similar CUDA implementations of the CKKS cryptosystem, differing only in
the underlying transform and related data structure. We ran several experiments and collected perfor-
mance metrics in different contexts, ranging from the basic direct comparison between the transforms
to measuring the impact of each one on the inference phase of the logistic regression algorithm. Our
observations suggest that, despite some specific polynomial ring configurations, the DGT in a stan-
dalone implementation does not offer the same performance as the NTT. However, when we consider
the entire cryptosystem, we noticed that the effects of the higher arithmetic density of the DGT on
other parts of the implementation is substantial, implying a considerable performance improvement
of up to 15% on the homomorphic multiplication. Furthermore, this speedup is consistent when we
consider a more complex application, indicating that the DGT suits better the target architecture
REED: Chiplet-Based Scalable Hardware Accelerator for Fully Homomorphic Encryption
Fully Homomorphic Encryption (FHE) has emerged as a promising technology for processing encrypted data without the need for decryption. Despite its potential, its practical implementation has faced challenges due to substantial computational overhead. To address this issue, we propose the chiplet-based FHE accelerator design `REED\u27, which enables scalability and offers high throughput, thereby enhancing homomorphic encryption deployment in real-world scenarios. It incorporates well-known wafer yield issues during fabrication which significantly impacts production costs. In contrast to state-of-the-art approaches, we also address data exchange overhead by proposing a non-blocking inter-chiplet communication strategy. We incorporate novel pipelined Number Theoretic Transform and automorphism techniques, leveraging parallelism and providing high throughput.
Experimental results demonstrate that REED 2.5D integrated circuit consumes 177 mm chip area, 82.5 W average power in 7nm technology, and achieves an impressive speedup of up to 5,982 compared to a CPU (24-core 2Intel X5690), and 2 better energy efficiency and 50\% lower development cost than state-of-the-art ASIC accelerator. To evaluate its practical impact, we are the to benchmark an encrypted deep neural network training. Overall, this work successfully enhances the practicality and deployability of fully homomorphic encryption in real-world scenarios
Homomorphic Encryption on GPU
Homomorphic encryption (HE) is a cryptosystem that allows secure processing of encrypted data. One of the most popular HE schemes is the Brakerski-Fan-Vercauteren (BFV), which supports somewhat (SWHE) and fully homomorphic encryption (FHE). Since overly involved arithmetic operations of HE schemes are amenable to concurrent computation, GPU devices can be instrumental in facilitating the practical use of HE in real world applications thanks to their superior parallel processing capacity.
This paper presents an optimized and highly parallelized GPU library to accelerate the BFV scheme. This library includes state-of-the-art implementations of Number Theoretic Transform (NTT) and inverse NTT that minimize the GPU kernel function calls. It makes an efficient use of the GPU memory hierarchy and computes 128 NTT operations for ring dimension of only in on RTX~3060Ti GPU. To the best of our knowlede, this is the fastest implementation in the literature. The library also improves the performance of the homomorphic operations of the BFV scheme. Although the library can be independently used, it is also fully integrated with the Microsoft SEAL library, which is a well-known HE library that also implements the BFV scheme. For one ciphertext multiplication, for the ring dimension and the modulus bit size of , our GPU implementation offers times speedup over the SEAL library running on a high-end CPU. The library compares favorably with other state-of-the-art GPU implementations of NTT and the BFV operations. Finally, we implement a privacy-preserving application that classifies encrpyted genome data for tumor types and achieve speedups of and over a CPU implementations using single and 16 threads, respectively. Our results indicate that GPU implementations can facilitate the deployment of homomorphic cryptographic libraries in real world privacy preserving applications
BASALISC: Programmable Hardware Accelerator for BGV Fully Homomorphic Encryption
Fully Homomorphic Encryption (FHE) allows for secure computation on encrypted data. Unfortunately, huge memory size, computational cost and bandwidth requirements limit its practicality. We present BASALISC, an architecture family of hardware accelerators that aims to substantially accelerate FHE computations in the cloud. BASALISC is the first to implement the BGV scheme with fully-packed bootstrapping – the noise removal capability necessary for arbitrary-depth computation. It supports a customized version of bootstrapping that can be instantiated with hardware multipliers optimized for area and power.
BASALISC is a three-abstraction-layer RISC architecture, designed for a 1 GHz ASIC implementation and underway toward 150mm2 die tape-out in a 12nm GF process. BASALISC\u27s four-layer memory hierarchy includes a two-dimensional conflict-free inner memory layer that enables 32 Tb/s radix-256 NTT computations without pipeline stalls. Its conflict-resolution permutation hardware is generalized and re-used to compute BGV automorphisms without throughput penalty. BASALISC also has a custom multiply-accumulate unit to accelerate BGV key switching.
The BASALISC toolchain comprises a custom compiler and a joint performance and correctness simulator. To evaluate BASALISC, we study its physical realizability, emulate and formally verify its core functional units, and we study its performance on a set of benchmarks. Simulation results show a speedup of more than 5,000× over HElib – a popular software FHE library