Fully Homomorphic Encryption (FHE) enables the processing of encrypted data
without decrypting it. FHE has garnered significant attention over the past
decade as it supports secure outsourcing of data processing to remote cloud
services. Despite its promise of strong data privacy and security guarantees,
FHE introduces a slowdown of up to five orders of magnitude as compared to the
same computation using plaintext data. This overhead is presently a major
barrier to the commercial adoption of FHE.
In this work, we leverage GPUs to accelerate FHE, capitalizing on a
well-established GPU ecosystem available in the cloud. We propose GME, which
combines three key microarchitectural extensions along with a compile-time
optimization to the current AMD CDNA GPU architecture. First, GME integrates a
lightweight on-chip compute unit (CU)-side hierarchical interconnect to retain
ciphertext in cache across FHE kernels, thus eliminating redundant memory
transactions. Second, to tackle compute bottlenecks, GME introduces special
MOD-units that provide native custom hardware support for modular reduction
operations, one of the most commonly executed sets of operations in FHE. Third,
by integrating the MOD-unit with our novel pipelined 64-bit integer
arithmetic cores (WMAC-units), GME further accelerates FHE workloads by 19%.
Finally, we propose a Locality-Aware Block Scheduler (LABS) that exploits the
temporal locality available in FHE primitive blocks. Incorporating these
microarchitectural features and compiler optimizations, we create a synergistic
approach achieving average speedups of 796×, 14.2×, and
2.3× over Intel Xeon CPU, NVIDIA V100 GPU, and Xilinx FPGA
implementations, respectively