128 research outputs found
CoFHEE: A Co-processor for Fully Homomorphic Encryption Execution
The migration of computation to the cloud has raised privacy concerns as
sensitive data becomes vulnerable to attacks since they need to be decrypted
for processing. Fully Homomorphic Encryption (FHE) mitigates this issue as it
enables meaningful computations to be performed directly on encrypted data.
Nevertheless, FHE is orders of magnitude slower than unencrypted computation,
which hinders its practicality and adoption. Therefore, improving FHE
performance is essential for its real world deployment. In this paper, we
present a year-long effort to design, implement, fabricate, and post-silicon
validate a hardware accelerator for Fully Homomorphic Encryption dubbed CoFHEE.
With a design area of , CoFHEE aims to improve performance of
ciphertext multiplications, the most demanding arithmetic FHE operation, by
accelerating several primitive operations on polynomials, such as polynomial
additions and subtractions, Hadamard product, and Number Theoretic Transform.
CoFHEE supports polynomial degrees of up to with a maximum
coefficient sizes of 128 bits, while it is capable of performing ciphertext
multiplications entirely on chip for . CoFHEE is fabricated in
55nm CMOS technology and achieves 250 MHz with our custom-built low-power
digital PLL design. In addition, our chip includes two communication interfaces
to the host machine: UART and SPI. This manuscript presents all steps and
design techniques in the ASIC development process, ranging from RTL design to
fabrication and validation. We evaluate our chip with performance and power
experiments and compare it against state-of-the-art software implementations
and other ASIC designs. Developed RTL files are available in an open-source
repository
GPU-based Private Information Retrieval for On-Device Machine Learning Inference
On-device machine learning (ML) inference can enable the use of private user
data on user devices without revealing them to remote servers. However, a pure
on-device solution to private ML inference is impractical for many applications
that rely on embedding tables that are too large to be stored on-device. In
particular, recommendation models typically use multiple embedding tables each
on the order of 1-10 GBs of data, making them impractical to store on-device.
To overcome this barrier, we propose the use of private information retrieval
(PIR) to efficiently and privately retrieve embeddings from servers without
sharing any private information. As off-the-shelf PIR algorithms are usually
too computationally intensive to directly use for latency-sensitive inference
tasks, we 1) propose novel GPU-based acceleration of PIR, and 2) co-design PIR
with the downstream ML application to obtain further speedup. Our GPU
acceleration strategy improves system throughput by more than over
an optimized CPU PIR implementation, and our PIR-ML co-design provides an over
additional throughput improvement at fixed model quality. Together,
for various on-device ML applications such as recommendation and language
modeling, our system on a single V100 GPU can serve up to queries per
second -- a throughput improvement over a CPU-based baseline --
while maintaining model accuracy
Accelerated Encrypted Execution of General-Purpose Applications
Fully Homomorphic Encryption (FHE) is a cryptographic method that guarantees the privacy and security of user data during computation. FHE algorithms can perform unlimited arithmetic computations directly on encrypted data without decrypting it. Thus, even when processed by untrusted systems, confidential data is never exposed. In this work, we develop new techniques for accelerated encrypted execution and demonstrate the significant performance advantages of our approach. Our current focus is the Fully Homomorphic Encryption over the Torus (CGGI) scheme, which is a current state-of-the-art method for evaluating arbitrary functions in the encrypted domain. CGGI represents a computation as a graph of homomorphic logic gates and each individual bit of the plaintext is transformed into a polynomial in the encrypted domain. Arithmetic on such data becomes very expensive: operations on bits become operations on entire polynomials. Therefore, evaluating even relatively simple nonlinear functions, such as a sigmoid, can take thousands of seconds on a single CPU thread. Using our novel framework for end-to-end accelerated encrypted execution called ArctyrEX, developers with no knowledge of complex FHE libraries can simply describe their computation as a C program that is evaluated over 40x faster on an NVIDIA DGX A100 and 6x faster with a single A100 relative to a 256-threaded CPU baseline
Medha: Microcoded Hardware Accelerator for computing on Encrypted Data
Homomorphic encryption enables computation on encrypted data, and hence it has a great potential in privacy-preserving outsourcing of computations to the cloud. Hardware acceleration of homomorphic encryption is crucial as software implementations are very slow. In this paper, we present design methodologies for building a programmable hardware accelerator for speeding up the cloud-side homomorphic evaluations on encrypted data.
First, we propose a divide-and-conquer technique that enables homomorphic evaluations in the polynomial ring RQ,2N = ZQ[x]/(x2N + 1) to use a hardware accelerator that has been built for the smaller ring RQ,N = ZQ[x]/(xN + 1). The technique makes it possible to use a single hardware accelerator flexibly for supporting several homomorphic encryption parameter sets.
Next, we present several architectural design methods that we use to realize the flexible and instruction-set accelerator architecture, which we call ‘Medha’. At every level of the implementation hierarchy, we explore possibilities for parallel processing. Starting from hardware-friendly parallel algorithms for the basic building blocks, we gradually build heavily parallel RNS polynomial arithmetic units. Next, many of these parallel units are interconnected elegantly so that their interconnections require the minimum number of nets, therefore making the overall architecture placement-friendly on the platform. As homomorphic encryption is computation- as well as data-centric, the speed of homomorphic evaluations depends greatly on the way the data variables are handled. For Medha, we take a memory-conservative design approach and get rid of any off-chip memory access during homomorphic evaluations.
Finally, we implement Medha in a Xilinx Alveo U250 FPGA and measure timing performances of the microcoded homomorphic addition, multiplication, key-switching, and rescaling routines for the leveled fully homomorphic encryption scheme RNSHEAAN at 200 MHz clock frequency. For the large parameter sets (log Q,N) = (438, 214) and (546, 215), Medha achieves accelerations by up to 68× and 78× times respectively compared to a highly optimized software implementation Microsoft SEAL running at 2.3 GHz
CUDA-Accelerated RNS Multiplication in Word-Wise Homomorphic Encryption Schemes
Homomorphic encryption (HE), which allows computation over encrypted data, has often been used to preserve privacy. However, the computationally heavy nature and complexity of network topologies make the deployment of HE schemes in the Internet of Things (IoT) scenario difficult. In this work, we propose CARM, the first optimized GPU implementation that covers BGV, BFV and CKKS, targeting for accelerating homomorphic multiplication using GPU in heterogeneous IoT systems. We offer constant-time low-level arithmetic with minimum instructions and memory usage, as well as performance- and memory-prior configurations, and exploit a parametric and generic design, and offer various trade-offs between resource and efficiency, yielding a solution suitable for accelerating RNS homomorphic multiplication on both high-performance and embedded GPUs. Through this, we can offer more real-time evaluation results and relieve the computational pressure on cloud devices. We deploy our implementations on two GPUs and achieve up to 378.4×, 234.5×, and 287.2× speedup for homomorphic multiplication of BGV, BFV, and CKKS on Tesla V100S, and 8.8×, 9.2×, and 10.3× on Jetson AGX Xavier, respectively
Accelerator for Computing on Encrypted Data
Fully homomorphic encryption enables computation on encrypted data, and hence it has a great potential in privacy-preserving outsourcing of computations. In this paper, we present a complete instruction-set processor architecture ‘Medha’ for accelerating the cloud-side operations of an RNS variant of the HEAAN homomorphic encryption scheme. Medha has been designed following a modular hardware design approach to attain a fast computation time for computationally expensive homomorphic operations on encrypted data. At every level of the implementation hierarchy, we explore possibilities for parallel processing. Starting from hardware-friendly parallel algorithms for the basic building blocks, we gradually build heavily parallel RNS polynomial arithmetic units. Next, many of these parallel units are interconnected elegantly so that their interconnections require the minimum number of nets, therefore making the overall architecture placement-friendly on the implementation platform. As homomorphic encryption is computation- as well as data-centric, the speed of homomorphic evaluations depends greatly on the way the data variables are handled. For Medha, we take a memory-conservative design approach and get rid of any off-chip memory access during homomorphic evaluations.
Our instruction-set accelerator Medha is programmable and it supports all homomorphic evaluation routines of the leveled
fully RNS-HEAAN scheme. For a reasonably large parameter with the polynomial ring dimension 214 and ciphertext coefficient modulus 438-bit (corresponding to 128-bit security), we implemented Medha in a Xilinx Alveo U250 card. Medha achieves the fastest computation latency to date and is almost 2.4× faster in latency and also somewhat smaller in area than a state-of-the-art reconfigurable hardware accelerator for the same parameter
RISE: RISC-V SoC for En/decryption Acceleration on the Edge for Homomorphic Encryption
Today edge devices commonly connect to the cloud to use its storage and
compute capabilities. This leads to security and privacy concerns about user
data. Homomorphic Encryption (HE) is a promising solution to address the data
privacy problem as it allows arbitrarily complex computations on encrypted data
without ever needing to decrypt it. While there has been a lot of work on
accelerating HE computations in the cloud, little attention has been paid to
the message-to-ciphertext and ciphertext-to-message conversion operations on
the edge. In this work, we profile the edge-side conversion operations, and our
analysis shows that during conversion error sampling, encryption, and
decryption operations are the bottlenecks. To overcome these bottlenecks, we
present RISE, an area and energy-efficient RISC-V SoC. RISE leverages an
efficient and lightweight pseudo-random number generator core and combines it
with fast sampling techniques to accelerate the error sampling operations. To
accelerate the encryption and decryption operations, RISE uses scalable,
data-level parallelism to implement the number theoretic transform operation,
the main bottleneck within the encryption and decryption operations. In
addition, RISE saves area by implementing a unified en/decryption datapath, and
efficiently exploits techniques like memory reuse and data reordering to
utilize a minimal amount of on-chip memory. We evaluate RISE using a complete
RTL design containing a RISC-V processor interfaced with our accelerator. Our
analysis reveals that for message-to-ciphertext conversion and
ciphertext-to-message conversion, using RISE leads up to 6191.19X and 2481.44X
more energy-efficient solution, respectively, than when using just the RISC-V
processor
HEAX: An Architecture for Computing on Encrypted Data
With the rapid increase in cloud computing, concerns surrounding data
privacy, security, and confidentiality also have been increased significantly.
Not only cloud providers are susceptible to internal and external hacks, but
also in some scenarios, data owners cannot outsource the computation due to
privacy laws such as GDPR, HIPAA, or CCPA. Fully Homomorphic Encryption (FHE)
is a groundbreaking invention in cryptography that, unlike traditional
cryptosystems, enables computation on encrypted data without ever decrypting
it. However, the most critical obstacle in deploying FHE at large-scale is the
enormous computation overhead.
In this paper, we present HEAX, a novel hardware architecture for FHE that
achieves unprecedented performance improvement. HEAX leverages multiple levels
of parallelism, ranging from ciphertext-level to fine-grained modular
arithmetic level. Our first contribution is a new highly-parallelizable
architecture for number-theoretic transform (NTT) which can be of independent
interest as NTT is frequently used in many lattice-based cryptography systems.
Building on top of NTT engine, we design a novel architecture for computation
on homomorphically encrypted data. We also introduce several techniques to
enable an end-to-end, fully pipelined design as well as reducing on-chip memory
consumption. Our implementation on reconfigurable hardware demonstrates
164-268x performance improvement for a wide range of FHE parameters.Comment: To appear in proceedings of ACM ASPLOS 202
- …