Search CORE

18 research outputs found

Accelerating Number Theoretic Transformations for Bootstrappable Homomorphic Encryption on GPUs

Author: Ahn Jung Ho
Jung Wonkyung
Kim Sangpyo
Park Jaiyoung
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 03/12/2020
Field of study

Homomorphic encryption (HE) draws huge attention as it provides a way of privacy-preserving computations on encrypted messages. Number Theoretic Transform (NTT), a specialized form of Discrete Fourier Transform (DFT) in the finite field of integers, is the key algorithm that enables fast computation on encrypted ciphertexts in HE. Prior works have accelerated NTT and its inverse transformation on a popular parallel processing platform, GPU, by leveraging DFT optimization techniques. However, these GPU-based studies lack a comprehensive analysis of the primary differences between NTT and DFT or only consider small HE parameters that have tight constraints in the number of arithmetic operations that can be performed without decryption. In this paper, we analyze the algorithmic characteristics of NTT and DFT and assess the performance of NTT when we apply the optimizations that are commonly applicable to both DFT and NTT on modern GPUs. From the analysis, we identify that NTT suffers from severe main-memory bandwidth bottleneck on large HE parameter sets. To tackle the main-memory bandwidth issue, we propose a novel NTT-specific on-the-fly root generation scheme dubbed on-the-fly twiddling (OT). Compared to the baseline radix-2 NTT implementation, after applying all the optimizations, including OT, we achieve 4.2x speedup on a modern GPU.Comment: 12 pages, 13 figures, to appear in IISWC 202

arXiv.org e-Print Archive

Crossref

CiFHER: A Chiplet-Based FHE Accelerator with a Resizable Structure

Author: Ahn Jung Ho
Choi Jaeyoung
Kim Jongmin
Kim Sangpyo
Publication venue
Publication date: 12/08/2023
Field of study

Fully homomorphic encryption (FHE) is in the spotlight as a definitive solution for privacy, but the high computational overhead of FHE poses a challenge to its practical adoption. Although prior studies have attempted to design ASIC accelerators to mitigate the overhead, their designs require excessive amounts of chip resources (e.g., areas) to contain and process massive data for FHE operations. We propose CiFHER, a chiplet-based FHE accelerator with a resizable structure, to tackle the challenge with a cost-effective multi-chip module (MCM) design. First, we devise a flexible architecture of a chiplet core whose configuration can be adjusted to conform to the global organization of chiplets and design constraints. The distinctive feature of our core is a recomposable functional unit providing varying computational throughput for number-theoretic transform (NTT), the most dominant function in FHE. Then, we establish generalized data mapping methodologies to minimize the network overhead when organizing the chips into the MCM package in a tiled manner, which becomes a significant bottleneck due to the technology constraints of MCMs. Also, we analyze the effectiveness of various algorithms, including a novel limb duplication algorithm, on the MCM architecture. A detailed evaluation shows that a CiFHER package composed of 4 to 64 compact chiplets provides performance comparable to state-of-the-art monolithic ASIC FHE accelerators with significantly lower package-wide power consumption while reducing the area of a single core to as small as 4.28mm

^2

.Comment: 15 pages, 9 figure

arXiv.org e-Print Archive

GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryption

Author: Abellán José L.
Agrawal Rashmi
Bao Yuhui
Ingare Alexander
Jonatan Gilbert
Joshi Ajay
Kaeli David
Kim John
Livesay Neal
Mora Evelio
Shen Michael
Shivdikar Kaustubh
Publication venue
Publication date: 19/09/2023
Field of study

64

-bit integer arithmetic cores (WMAC-units), GME further accelerates FHE workloads by

19\%

. Finally, we propose a Locality-Aware Block Scheduler (LABS) that exploits the temporal locality available in FHE primitive blocks. Incorporating these microarchitectural features and compiler optimizations, we create a synergistic approach achieving average speedups of

796\times

14.2\times

, and

2.3\times

over Intel Xeon CPU, NVIDIA V100 GPU, and Xilinx FPGA implementations, respectively

arXiv.org e-Print Archive

GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryption

Author: Ajay Joshi David Kaeli
Gilbert Jonatan Evelio Mora
José L. Abellán Alexander Ingare
Kaustubh Shivdikar Yuhui Bao
Neal Livesay John Kim
Rashmi Agrawal Michael Shen
Publication venue: 'American College of Medical Physics (ACMP)'
Publication date: 01/01/2023
Field of study

Fully Homomorphic Encryption (FHE) enables the processing of encrypted data without decrypting it. FHE has garnered significant attention over the past decade as it supports secure outsourcing of data processing to remote cloud services. Despite its promise of strong data privacy and security guarantees, FHE introduces a slowdown of up to five orders of magnitude as compared to the same computation using plaintext data. This overhead is presently a major barrier to the commercial adoption of FHE. While prior efforts recommend moving to custom accelerators to accelerate FHE computing, these solutions lack cost-effectiveness and scalability. In this work, we leverage GPUs to accelerate FHE, capitalizing on a well-established GPU ecosystem that is available in the cloud. We propose GME, which combines three key microarchitectural extensions along with a compile-time optimization to the current AMD CDNA GPU architecture. First, GME integrates a lightweight on-chip compute unit (CU)-side hierarchical interconnect to retain ciphertext in cache across FHE kernels, thus eliminating redundant memory transactions and improving performance. Second, to tackle compute bottlenecks, GME introduces special MOD-units that provide native custom hardware support for modular reduction operations, one of the most commonly executed sets of operations in FHE. Third, by integrating the MOD-unit with our novel pipelined 64-bit integer arithmetic cores (WMAC-units), GME further accelerates FHE workloads by 19%. Finally, we propose a Locality-Aware Block Scheduler (LABS) that improves FHE workload performance, exploiting the temporal locality available in FHE primitive blocks. Incorporating these microarchitectural features and compiler optimizations, we create a synergistic approach achieving average speedups of 796×, 14.2×, and 2.3× over Intel Xeon CPU, NVIDIA V100 GPU, and Xilinx FPGA implementations, respectively

DIGITUM Universidad de Murcia (España)

Practical MPC+FHE with Applications in Secure Multi-PartyNeural Network Evaluation

Author: Changchang Ding
Ruiyu Zhu
Yan Huang
Publication venue: International Association for Cryptologic Research (IACR)
Publication date: 27/06/2020
Field of study

The theoretical idea of using FHE to realize MPC has been therefor over a decade. Existing threshold (and multi-key) FHE schemes were constructed by modifying and analyzing a traditional single-keyFHE in a case-by-case manner, thus technically highly-demanding.This work explores a new approach to build threshold FHE (therebyMPC schemes) through tailoring generic MPC protocols to the base FHE scheme while requiring no effort in FHE redesign. We applied our approach to two representative Ring-LWE-based FHE schemes: CKKS and GHS, producing GMPFHE-CKKS and GMPFHE-GHS. We developed MPC protocols based on GMPFHE-CKKS and GMPFHE-GHS which are secure against any number of passive but colluding adversaries. The online cost of our MPC protocol is

O(|C|)

, as opposed to

O(|C|·n^2)

for existing MPC protocols, and our offline cost is independent of

|C|

. We experimentally show that the GMPFHE-CKKS-based MPC protocol offers unparalleled amortized performance on multi-party neural network evaluation

Cryptology ePrint Archive

High-precision RNS-CKKS on fixed but smaller word-size architectures: theory and application

Author: Duhyeong Kim
Fillipe D. M. de Souza
Flavio Bergamaschi
Hubert de Lassus
Huijing Gong
Jai Hyun Park
Jongmin Kim
Jung Hee Cheon
Jung Ho Ahn
Michael Steiner
Minsik Kang
Rashmi Agrawal
Ro Cammarota
Wen Wang
Publication venue: International Association for Cryptologic Research (IACR)
Publication date: 25/09/2023
Field of study

A prevalent issue in the residue number system (RNS) variant of the Cheon-Kim-Kim-Song (CKKS) homomorphic encryption (HE) scheme is the challenge of efficiently achieving high precision on hardware architectures with a fixed, yet smaller, word-size of bit-length

W

, especially when the scaling factor satisfies

\log\Delta > W

. In this work, we introduce an efficient solution termed composite scaling. In this approach, we group multiple RNS primes as

q_\ell:= \prod_{j=0}^{t-1} q_{\ell,j}

such that

\log q_{\ell,j} < W

for

0\le j < t

, and use each composite

q_\ell

in the rescaling procedure as

\mathsf{ct}\mapsto \lfloor \mathsf{ct} / q_\ell\rceil

. Here, the number of primes, denoted by

t

, is termed the composition degree. This strategy contrasts the traditional rescaling method in RNS-CKKS, where each

q_\ell

is chosen as a single

\log\Delta

-bit prime, a method we designate as single scaling. To achieve higher precision in single scaling, where

\log\Delta > W

, one would either need a novel hardware architecture with word size

W\u27 > \log\Delta

or would have to resort to relatively inefficient solutions rooted in multi-precision arithmetic. This problem, however, doesn\u27t arise in composite scaling. In the composite scaling approach, the larger the composition degree

t

, the greater the precision attainable with RNS-CKKS across an extensive range of secure parameters tailored for workload deployment. We have integrated composite scaling RNS-CKKS into both OpenFHE and Lattigo libraries. This integration was achieved via a concrete implementation of the method and its application to the most up-to-date workloads, specifically, logistic regression training and convolutional neural network inference. Our experiments demonstrate that single and composite scaling approaches are functionally equivalent, both theoretically and practically

Cryptology ePrint Archive

Performance of Hierarchical Transforms in Homomorphic Encryption: A case study on Logistic Regression inference

Author: Diego F. Aranha
Jheyne N. Ortiz
Pedro Geraldo M. R. Alves
Publication venue: International Association for Cryptologic Research (IACR)
Publication date: 31/01/2022
Field of study

Recent works challenged the Number-Theoretic Transform (NTT) as the most efficient method for polynomial multiplication in GPU implementations of Fully Homomorphic Encryption schemes such as CKKS and BFV. In particular, these works argue that the Discrete Galois Transform (DGT) is a better candidate for this particular case. However, these claims were never rigorously validated, and only intuition was used to argue in favor of each transform. This work brings some light on the dis- cussion by developing similar CUDA implementations of the CKKS cryptosystem, differing only in the underlying transform and related data structure. We ran several experiments and collected perfor- mance metrics in different contexts, ranging from the basic direct comparison between the transforms to measuring the impact of each one on the inference phase of the logistic regression algorithm. Our observations suggest that, despite some specific polynomial ring configurations, the DGT in a stan- dalone implementation does not offer the same performance as the NTT. However, when we consider the entire cryptosystem, we noticed that the effects of the higher arithmetic density of the DGT on other parts of the implementation is substantial, implying a considerable performance improvement of up to 15% on the homomorphic multiplication. Furthermore, this speedup is consistent when we consider a more complex application, indicating that the DGT suits better the target architecture

Cryptology ePrint Archive

REED: Chiplet-Based Scalable Hardware Accelerator for Fully Homomorphic Encryption

Author: Ahmet Can Mert
Aikata Aikata
Maxim Deryabin
Sujoy Sinha Roy
Sunmin Kwon
Publication venue: International Association for Cryptologic Research (IACR)
Publication date: 05/08/2023
Field of study

Fully Homomorphic Encryption (FHE) has emerged as a promising technology for processing encrypted data without the need for decryption. Despite its potential, its practical implementation has faced challenges due to substantial computational overhead. To address this issue, we propose the

first

chiplet-based FHE accelerator design `REED\u27, which enables scalability and offers high throughput, thereby enhancing homomorphic encryption deployment in real-world scenarios. It incorporates well-known wafer yield issues during fabrication which significantly impacts production costs. In contrast to state-of-the-art approaches, we also address data exchange overhead by proposing a non-blocking inter-chiplet communication strategy. We incorporate novel pipelined Number Theoretic Transform and automorphism techniques, leveraging parallelism and providing high throughput. Experimental results demonstrate that REED 2.5D integrated circuit consumes 177 mm

^2

chip area, 82.5 W average power in 7nm technology, and achieves an impressive speedup of up to 5,982

\times

compared to a CPU (24-core 2

\times

Intel X5690), and 2

\times

better energy efficiency and 50\% lower development cost than state-of-the-art ASIC accelerator. To evaluate its practical impact, we are the

first

to benchmark an encrypted deep neural network training. Overall, this work successfully enhances the practicality and deployability of fully homomorphic encryption in real-world scenarios

Cryptology ePrint Archive

Homomorphic Encryption on GPU

Author: Ali Şah Özcan
Can Ayduman
Enes Recep Türkoğlu
Erkay Savaş
Publication venue: International Association for Cryptologic Research (IACR)
Publication date: 17/11/2022
Field of study

Homomorphic encryption (HE) is a cryptosystem that allows secure processing of encrypted data. One of the most popular HE schemes is the Brakerski-Fan-Vercauteren (BFV), which supports somewhat (SWHE) and fully homomorphic encryption (FHE). Since overly involved arithmetic operations of HE schemes are amenable to concurrent computation, GPU devices can be instrumental in facilitating the practical use of HE in real world applications thanks to their superior parallel processing capacity. This paper presents an optimized and highly parallelized GPU library to accelerate the BFV scheme. This library includes state-of-the-art implementations of Number Theoretic Transform (NTT) and inverse NTT that minimize the GPU kernel function calls. It makes an efficient use of the GPU memory hierarchy and computes 128 NTT operations for ring dimension of

2^{14}

only in

176.1~\mu s

on RTX~3060Ti GPU. To the best of our knowlede, this is the fastest implementation in the literature. The library also improves the performance of the homomorphic operations of the BFV scheme. Although the library can be independently used, it is also fully integrated with the Microsoft SEAL library, which is a well-known HE library that also implements the BFV scheme. For one ciphertext multiplication, for the ring dimension

2^{14}

and the modulus bit size of

438

, our GPU implementation offers

\mathbf{63.4}

times speedup over the SEAL library running on a high-end CPU. The library compares favorably with other state-of-the-art GPU implementations of NTT and the BFV operations. Finally, we implement a privacy-preserving application that classifies encrpyted genome data for tumor types and achieve speedups of

42.98

and

5.7

over a CPU implementations using single and 16 threads, respectively. Our results indicate that GPU implementations can facilitate the deployment of homomorphic cryptographic libraries in real world privacy preserving applications

Cryptology ePrint Archive

BASALISC: Programmable Hardware Accelerator for BGV Fully Homomorphic Encryption

Author: Ben Selfridge
Brian Huffman
Daniel Wagner
David W. Archer
Frederik Vercauteren
Georgios Dimou
Hilder V. L. Pereira
Ingrid Verbauwhede
Michiel Van Beirendonck
Robin Geelen
Tynan McAuley
Publication venue: International Association for Cryptologic Research (IACR)
Publication date: 06/09/2023
Field of study

Fully Homomorphic Encryption (FHE) allows for secure computation on encrypted data. Unfortunately, huge memory size, computational cost and bandwidth requirements limit its practicality. We present BASALISC, an architecture family of hardware accelerators that aims to substantially accelerate FHE computations in the cloud. BASALISC is the first to implement the BGV scheme with fully-packed bootstrapping – the noise removal capability necessary for arbitrary-depth computation. It supports a customized version of bootstrapping that can be instantiated with hardware multipliers optimized for area and power. BASALISC is a three-abstraction-layer RISC architecture, designed for a 1 GHz ASIC implementation and underway toward 150mm2 die tape-out in a 12nm GF process. BASALISC\u27s four-layer memory hierarchy includes a two-dimensional conflict-free inner memory layer that enables 32 Tb/s radix-256 NTT computations without pipeline stalls. Its conflict-resolution permutation hardware is generalized and re-used to compute BGV automorphisms without throughput penalty. BASALISC also has a custom multiply-accumulate unit to accelerate BGV key switching. The BASALISC toolchain comprises a custom compiler and a joint performance and correctness simulator. To evaluate BASALISC, we study its physical realizability, emulate and formally verify its core functional units, and we study its performance on a set of benchmarks. Simulation results show a speedup of more than 5,000× over HElib – a popular software FHE library

Cryptology ePrint Archive