GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic
  Encryption

Abellán, José L.; Agrawal, Rashmi; Bao, Yuhui; Ingare, Alexander; Jonatan, Gilbert; Joshi, Ajay; Kaeli, David; Kim, John; Livesay, Neal; Mora, Evelio; Shen, Michael; Shivdikar, Kaustubh

GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryption

Authors: José L. Abellán
Rashmi Agrawal
Yuhui Bao
Alexander Ingare
Gilbert Jonatan
Ajay Joshi
David Kaeli
John Kim
Neal Livesay
Evelio Mora
Michael Shen
Kaustubh Shivdikar
Publication date: 19 September 2023
Publisher
Doi

Abstract

Fully Homomorphic Encryption (FHE) enables the processing of encrypted data without decrypting it. FHE has garnered significant attention over the past decade as it supports secure outsourcing of data processing to remote cloud services. Despite its promise of strong data privacy and security guarantees, FHE introduces a slowdown of up to five orders of magnitude as compared to the same computation using plaintext data. This overhead is presently a major barrier to the commercial adoption of FHE. In this work, we leverage GPUs to accelerate FHE, capitalizing on a well-established GPU ecosystem available in the cloud. We propose GME, which combines three key microarchitectural extensions along with a compile-time optimization to the current AMD CDNA GPU architecture. First, GME integrates a lightweight on-chip compute unit (CU)-side hierarchical interconnect to retain ciphertext in cache across FHE kernels, thus eliminating redundant memory transactions. Second, to tackle compute bottlenecks, GME introduces special MOD-units that provide native custom hardware support for modular reduction operations, one of the most commonly executed sets of operations in FHE. Third, by integrating the MOD-unit with our novel pipelined

64

-bit integer arithmetic cores (WMAC-units), GME further accelerates FHE workloads by

19\%

. Finally, we propose a Locality-Aware Block Scheduler (LABS) that exploits the temporal locality available in FHE primitive blocks. Incorporating these microarchitectural features and compiler optimizations, we create a synergistic approach achieving average speedups of

796\times

,

14.2\times

, and

2.3\times

over Intel Xeon CPU, NVIDIA V100 GPU, and Xilinx FPGA implementations, respectively

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2309.11001

Last time updated on 10/10/2023